Skip to main content

Command Palette

Search for a command to run...

The Tradeoffs That Define Every AI Product Decision

Updated
14 min read
R
Senior Product Manager writing about two sides of AI: building AI products that work at scale, and using AI to work more effectively as a PM. I share frameworks for Applied AI product management—economics, evaluation, agent design, responsible deployment—alongside practical guides for AI-powered productivity, workflows, and decision-making. If you're building AI products or figuring out how to leverage AI in your PM workflow (or both), this is for you. Currently based in Seattle.

Part 8 of the Applied AI Product Management series. Posts 6 and 7 covered architecture and infrastructure. This post covers something harder: the decisions that don't have clean answers, where every choice costs something, and where the teams that do well are the ones who make those choices consciously rather than by default.


In 2019, a major social platform optimized its recommendation algorithm for watch time. The model got very good at maximizing it. Engagement numbers looked strong for eighteen months. Then user satisfaction surveys started declining. Then retention. Then a congressional hearing.

Nobody made a bad decision. They made an implicit one. They optimized hard for what they could measure without asking what they were giving up in the process. By the time the tradeoff became visible, it had been running for a year and a half.

That's the pattern this post is about. Not the tradeoffs teams make consciously and get wrong. The ones they make without knowing they're choosing at all.


False positives vs. false negatives

This is the one I see PM teams get wrong most consistently, and it has the most direct consequences for users.

A false positive is when the model fires when it shouldn't. A spam filter marking a real email. A fraud system blocking a legitimate transaction. Grammarly underlining something that's actually correct. A false negative is when the model stays quiet when it should have acted. Spam landing in the inbox. Fraud going through. A real error getting missed.

Every model produces both. The threshold you set determines the ratio. Make the model more conservative and you get fewer false positives but more false negatives. Lower the threshold and it flips. You cannot minimize both simultaneously. That's not an engineering limitation. It's math.

What most teams don't do is ask which error type costs more. And the answer varies enormously depending on what you're building.

Three products that look similar but have completely opposite answers:

A spam filter's catastrophic failure is a false positive. A real email from a client or a doctor disappearing silently into spam destroys trust in a way that's often unrecoverable. Missing some spam is annoying. Missing an important message is a different category of problem entirely. So you optimize for precision, you let some spam through, and you accept that tradeoff consciously.

Fraud detection flips this. A false positive, a declined legitimate transaction, frustrates a customer who will probably sort it out. A false negative, fraud going through, costs real money and creates liability. If it happens enough it undermines the product's reason to exist. High recall, even at the cost of some false alarms.

Grammarly is the most interesting case. Their catastrophic failure is also a false positive, but for a different reason. Too many wrong suggestions and users stop reading any of them. This is called suggestion fatigue. Once that trust breaks, even the corrections that are right get ignored. So Grammarly deliberately misses real errors rather than risk the false positives that would make the entire product invisible to users who've stopped paying attention.

Before any model ships, the conversation worth having is simple: if this model makes 100 errors, which 50 would we rather it make? Most teams have never had that conversation explicitly. The threshold is sitting at whatever default the library shipped with.

One more thing worth knowing. Different user segments often have completely different tolerances. Grammarly's premium users accept more suggestions than free users. Stripe uses different fraud thresholds for transactions above and below $1,000. The cost of a false negative scales with transaction size while the cost of a false positive stays roughly constant. Segmented thresholds are more work. They're almost always worth it.


Accuracy, latency, and cost

You can optimize two of these. Not three. This isn't a temporary state that better engineering will eventually resolve. It's a structural constraint, and the teams that navigate it well pick their concession deliberately rather than discovering it later when the bill arrives or users start complaining.

ChatGPT protects accuracy and latency. Users won't tolerate slow responses or wrong answers, and the pricing model supports the infrastructure cost. The concession is cost. It's an expensive product to run.

Spotify Discover Weekly protects accuracy and cost. Batch generation every Monday means users wait a week for updated recommendations. That's a real latency concession. But the quality justifies the wait and the economics work at scale.

Gmail Smart Compose protects latency and cost. Suggestions need to appear in under 100 milliseconds on hundreds of millions of devices. A small on-device model with a lower accuracy ceiling is the only architecture that gets you there. The concession is that the suggestions are less sophisticated than a cloud model would produce.

Each product made a different concession. None of them made it randomly. Each one figured out which constraint their users would notice least and gave that one up.

The cost modeling most teams skip: run the numbers at 10x your current scale before you commit to an architecture. A model costing $0.03 per query sounds manageable until you're at 500,000 queries per day and the monthly bill is $450,000. Optimization strategies like model tiering, response caching, and prompt compression are much harder to retrofit than to build for from the start.

Figure out your load-bearing constraint first. If users abandon when the product is slow, protect latency above everything else. If errors create liability or destroy trust, protect accuracy. If the unit economics only work at massive scale, cost is your constraint to defend. Then design the other two around the one you've chosen.


Model complexity vs. interpretability

More capable models are almost always less interpretable. A logistic regression makes decisions you can explain in plain English. A deep learning model makes decisions through billions of parameters that its creators can't fully account for even if they wanted to.

For most consumer AI features, recommendations, search ranking, content generation, this is fine. Users don't need to know why Spotify suggested a song. They need the song to be good.

For a growing set of use cases it isn't fine, and the consequences of getting this wrong are real.

Credit decisions require explanation in most jurisdictions. A lender using a black-box model to deny a loan cannot satisfy the legal requirement to tell the applicant why. Medical AI used to guide clinical decisions needs to show physicians its reasoning, not because doctors distrust AI but because they need to integrate it with their own judgment, which requires understanding what the model responded to. Hiring tools that show disparate impact across protected groups need to be auditable.

The practical toolkit when interpretability matters: linear models and decision trees when you need full transparency and the relationship between inputs and outputs is relatively simple. Gradient boosting when you need higher accuracy on structured data and feature importance is a sufficient proxy for explanation. Post-hoc methods like LIME and SHAP when you need to deploy a complex model but surface per-decision explanations. A separate explanation model trained to approximate the complex model's decisions in interpretable terms when the complex model's accuracy is genuinely non-negotiable.

Build interpretability in from the start if you'll need it. Retrofitting is significantly harder and often produces explanations that satisfy compliance requirements on paper while telling affected users nothing meaningful.


Personalization vs. privacy

Better personalization requires more data. More data creates more privacy exposure. Every AI product with personalization features eventually has to resolve this, and how you resolve it has real consequences for user trust.

The failure on the personalization side: features that feel invasive. The "how did it know that?" reaction that should feel delightful instead feels like surveillance. Users who feel watched don't just distrust the feature. They distrust the product.

The failure on the privacy side: collecting less than you need and shipping a personalization experience so generic it provides no value. Users who see recommendations that clearly know nothing about them don't trust those either, just for different reasons.

The resolution space is larger than most teams realize. Federated learning trains models on-device and shares only model updates, never raw data. Apple's keyboard learns from what you type without that data ever leaving your phone. Differential privacy adds mathematically guaranteed noise to aggregate statistics, preventing individuals from being identified even in aggregate analysis. On-device inference, covered in Post 7, keeps both the model and user data local entirely.

Start by asking what personalization your product actually needs to deliver its core value. Not what would be nice to have. What's the minimum data that produces a meaningfully better experience than no personalization at all? Then ask whether that minimum can be collected with explicit consent and transparent disclosure.

Users are substantially more willing to share data when they understand why and can see the benefit directly. Spotify users understand that listening history improves recommendations because the connection is obvious. The products that erode trust are the ones where the data collection is opaque and the benefit is diffuse.

On the regulatory side: the EU AI Act, GDPR, and a growing body of US state privacy laws have turned personalization vs. privacy from a product values question into a compliance question for any product with EU users or sensitive data categories. Build the privacy architecture before you need it. Retrofitting costs significantly more than designing it in.


Automation vs. control

The instinct when AI automation produces errors is to add more human review. The instinct when human review creates bottlenecks is to remove it. Both instincts treat automation and control as a dial where more of one means less of the other.

The best AI products reject this framing. They design for high automation and high control at the same time. Finding the product design that achieves both is the actual job.

Gmail's spam filter is fully automated. It moves millions of emails per day without a human in the loop. It is also highly controllable. One click moves an email back to the inbox. One click teaches the filter to never move that sender's mail again. The automation doesn't reduce control. It's designed so that control is easy to exercise when users want it.

GitHub Copilot suggests code without interrupting the developer's flow. High automation. The suggestion appears in gray text that requires an explicit Tab key acceptance before it affects anything. High control. The developer never has to undo an automated action because automation never takes effect without consent.

Tesla Autopilot controls the vehicle. High automation. The driver must stay ready to take over, gets audio and visual alerts if hands leave the wheel, and can override instantly. High control. Human override is always available and always fast.

The automation continuum runs from Level 0, humans doing everything, through Level 5, fully automated with no human involvement at all. The insight most teams miss is that a single product shouldn't be at one level across all its actions. Different actions warrant different automation levels based on stakes and reversibility.

Gmail auto-categorizes email at Level 3, AI decides and user can override, but requires user action to send at Level 2, AI suggests and user approves. Miscategorizing an email has low stakes and is easily reversed. Sending the wrong email has high stakes and is not. Different actions, different levels, same product.

For each action your AI can take, ask two things: what's the cost of a mistake, and how easily can it be undone? High cost with low reversibility means human approval before the action happens. Low cost with easy reversal means automation with undo. The mistake is applying the same automation level to all actions regardless of their individual stakes.


Short-term metrics vs. long-term model health

This is the hardest tradeoff to see in real time and the most expensive to unwind.

The trap works like this. You have a metric that's measurable. The model gets good at improving it. The metric goes up. And something that actually matters to users degrades in a way that doesn't show up until much later.

Watch time went up. Content that was emotionally provocative, that generated outrage, that kept users in engagement loops they didn't consciously choose, all of it got recommended more. User satisfaction, measured separately and less frequently, declined. The gap between the metric going up and the business consequence becoming visible was long enough that the optimization ran for years.

The problem isn't that watch time is a bad metric. It's that watch time was the only metric with real teeth. The one the model was rewarded for, the one that determined resource allocation, the one in every review deck. Metrics that weren't being optimized weren't being protected.

Four mechanisms that structurally prevent this:

Guardrail metrics with actual consequences. Not metrics you mention in reviews. Metrics that trigger investigation or rollback if they move against you. A model update that improves engagement while degrading a guardrail doesn't ship without explicit sign-off from someone senior enough to own that call.

Permanent holdout groups. Keep one to five percent of users in a control group permanently, not just for the duration of an A/B test. This lets you measure the cumulative effect of optimizations over months and years. The two-week test tells you what changed. The permanent holdout tells you what compounded.

Long-horizon cohort analysis. Measure satisfaction and retention at Week 1, Week 4, Week 8, and Week 12 for every major model change. The novelty effect reliably inflates Week 1 numbers. Users engage with a new AI feature because it's new, not because it's genuinely valuable. Week 8 retention is what tells you which one you built.

Diversity constraints explicit in the model objective. If your product should surface a range of content or creators, encode that as a constraint rather than hoping engagement metrics produce it naturally. They won't. Spotify needed explicit constraints to stop recommendations from collapsing toward mainstream artists that dominated engagement signals.

The principle underneath all of this: the metrics you protect are the values your product actually expresses. Every metric you track but don't protect is a value you're claiming without defending. Users notice the difference eventually, even if they can't articulate exactly what feels off.


The pattern across all six

Six tradeoffs, one underlying shape.

False positive thresholds set to library defaults. Cost models never run at realistic scale. Interpretability requirements discovered during a regulatory inquiry rather than a design review. Privacy architecture bolted on under pressure. Automation levels applied uniformly regardless of individual action stakes. Engagement metrics optimized without guardrails on what actually matters long-term.

None of these are failures of intelligence. They're failures of deliberateness. Smart people moving fast without stopping to ask what they're trading away.

The habit worth building: in every model review, every architecture discussion, every AI product decision, ask explicitly what you're optimizing for and what you're giving up to get it. The question takes thirty seconds. The answer, surfaced early, can save months.


What comes next

You now have the tradeoffs framework. The next question is where the inputs to those tradeoffs actually come from, specifically the data that makes your model work in the first place.

Data strategy is the execution decision that trips up more AI products than any architecture choice. Teams underestimate how much labeled data they need, how expensive quality labeling actually is, how to get started when you have no data yet, and how to build data pipelines that don't become the bottleneck as you scale.

In the next post, we'll cover all of it: how to source data, how to label it without breaking your budget, how to handle the cold start problem when you're launching something new, and how to build a data strategy that compounds rather than just sustains.