How to Scope and Ship an AI Feature Without Blowing Up

Part 10 of the Applied AI Product Management series. Post 9 covered data strategy. Before that, Posts 6 through 8 covered architecture, infrastructure, and tradeoffs. All of that assumed you've already validated that the feature is worth building. Most teams skip that step.
A team spent six months building a document classification model. The accuracy was good. The infrastructure was solid. The labeling pipeline worked. When they shipped it, users barely touched it. Not because the model was wrong. Because the problem it solved wasn't painful enough to change user behavior.
A two-week manual test at the start, where a person did the classification by hand, would have shown the same thing. Six months earlier. For almost no cost.
Gartner forecasts more than 80% of enterprises will have used GenAI APIs or deployed GenAI-enabled apps in production by 2026, up from under 5% in 2023. The ambition is real. The execution failure rate is equally real. Most AI features that fail don't fail because the ML didn't work. They fail because the product definition was wrong, the scope was too broad, or the team committed to building before validating that anyone wanted the output.
This post is about the craft that happens before the model gets trained.
Is this worth building at all
The first PM discipline in AI is recognizing which opportunities are genuinely worth pursuing and which ones are technically interesting but commercially thin.
The questions that separate the two aren't complex. They're just rarely asked explicitly before work starts.
Is the task repetitive enough that a pattern exists? One-off judgments that depend heavily on context and intuition are hard for models to learn because there's no consistent signal. Tasks done thousands of times per day, with similar inputs producing similar outputs, are where ML adds real value. Customer intent classification, document routing, content moderation at scale, code completion. The pattern has to be there for the model to learn it.
Can you measure success unambiguously? "Better user experience" is not a success criterion. "Reduce the time users spend manually categorizing support tickets by 40%" is. If you can't state what success looks like in a number before building, you won't know if you got there after.
Is there a simpler solution you haven't honestly tried? This was covered in Post 1 and bears repeating in the scoping context. Before committing to ML, ask whether better UX, a clearer onboarding flow, or a simple rules-based filter would close 80% of the gap. The teams that benefit most from ML are the ones who exhausted simpler options first, not the ones who jumped to ML because it felt more sophisticated.
What's the realistic error tolerance? Some products can absorb a 10% error rate without meaningful user impact. Others can't absorb 1%. Medical decisions, financial transactions, legal conclusions. Knowing your error tolerance before building determines which approach is viable. A product that needs 99.9% accuracy on a task where current models hit 92% isn't an AI opportunity yet. It's a future opportunity that needs to wait for models to improve.
If you can't answer all four honestly before engineering starts, the right call is a discovery sprint, not a development sprint. Two weeks of research, a handful of user interviews, and a manual prototype will tell you more than six months of model training on a problem that turns out not to matter.
The feasibility assessment nobody does but everyone should
Assume the opportunity is real. The next question is whether the team can actually execute it. This is where most AI product timelines go wrong, not in the ML work itself but in the failure to assess feasibility before committing.
A proper feasibility assessment takes one to two weeks and answers five questions. The answers determine whether you proceed, pivot, or stop.
Can a foundation model do this without custom training? This is the first thing to check, and most teams skip it because it feels too simple. Take 20 to 30 representative examples of the task, run them through Claude or GPT-5.5 with a well-written prompt, and evaluate the outputs. If the quality is acceptable, you don't have a model-building project. You have an integration project. That's faster, cheaper, and lower-risk. Post 6 covered this from an architecture perspective. In the scoping context, it's even more fundamental. Don't spend six months building what you can validate in two days.
Is there signal in the data? If a foundation model doesn't get you there, check whether your existing data contains a learnable pattern. Train a simple baseline model, the simplest one that could possibly work, and see if it beats a naive heuristic. If logistic regression trained on your existing data performs at 52% on a task where random guessing produces 50%, the signal probably isn't there. More sophisticated models won't fix that. More data, or a different framing of the problem, might.
Do you have enough labeled data, or a realistic path to it? Post 9 covered the cost and timeline reality of labeling in detail. In the feasibility context, the question is whether the data situation is compatible with the timeline. If you need 10,000 labeled examples, expert labeling costs $50,000 and takes three months, and the project timeline is eight weeks, you have a feasibility problem before a single line of model code is written.
What's the latency requirement, and can any viable model meet it? This matters more than most scoping conversations account for. A model that needs to respond in 80 milliseconds while the user is typing has a completely different architecture requirement than one running as a background job. Knowing this before architecture decisions get made prevents expensive retrofits later.
Can you explain failures to users when they happen? Not philosophically. Practically. If the model gets it wrong, what does the user see and what can they do? Features that fail silently erode trust fast. Features with graceful fallbacks and clear override mechanisms erode it slowly or not at all. Designing this before building means the failure handling is part of the product, not an afterthought.
The output of this assessment should be a one-page feasibility memo that answers these five questions with evidence, not opinion. Not "we think the data is probably there." Show the baseline results. Not "latency should be fine." Show the benchmarks. This memo is the gate before full investment. Teams that skip it trade a week of disciplined work for months of surprised pivots.
The MVP ladder: fake it before you build it
The most counterintuitive advice in AI product development is also the most consistently correct: don't build AI first.
The sequence that works in practice moves through stages, and the discipline is staying at each stage long enough to learn what you need before moving up.
Stage one is manual. A person does the task the AI will eventually do. Not for user-facing delivery necessarily, but for validation. Does the output, when done well by a human, change user behavior? Does the value actually materialize when the task gets done correctly? This is the question the six-month classification project from the opening story skipped. Two weeks of a team member manually doing the classification would have revealed that users didn't engage with the results regardless of quality.
Stage two is the Wizard of Oz. The user interface looks like AI is running. Behind the scenes, a human is doing the work. The user sees a polished experience. The team learns whether users engage with the output, how they want to interact with it, what the edge cases are, and what failure looks like in practice. This stage is underused because it feels deceptive. It isn't. You're not lying about the product's eventual architecture. You're validating the product experience before building the infrastructure to automate it. Grammarly's earliest corrections were reviewed by humans before any ML model touched them. The product UX was validated before the model was optimized.
Stage three is rules. If the Wizard of Oz stage confirms value, before writing model code, ask whether a rules-based approach handles the common cases well enough to ship something real. Rules are deterministic, explainable, and cheap to build. A rules-based filter that handles 70% of cases correctly and passes the remaining 30% to humans is a shipping product. It collects data that trains the model that eventually handles 90% of cases without human involvement. Skipping rules to jump straight to ML usually means building a more expensive system on less data to solve a problem you haven't fully understood yet.
Stage four is simple ML. Once you have user behavior data from the rules-based stage and enough labeled examples to train a basic model, move to classical ML. Not deep learning. Not a fine-tuned LLM. Logistic regression, gradient boosting, a simple classifier. This stage validates that the ML approach can outperform the rules, and it produces a model fast enough to iterate on quickly. It also forces the team to confront the data requirements, the evaluation setup, and the serving infrastructure before those decisions are made under pressure.
Stage five is production ML. Once the simple model has proven the concept and you understand the performance ceiling you need to reach, invest in the more sophisticated approach. Fine-tuning, deeper architecture, larger models, more sophisticated retrieval. The benchmark for a well-scoped AI MVP is 8 to 12 weeks from planning to first user. Teams that reach that benchmark consistently are the ones that move through these stages rather than trying to start at stage five.
The mistake that costs the most: skipping to stage four or five because the earlier stages feel beneath the team's technical ambition. The stages aren't a signal of low ambition. They're evidence of good product judgment.
Working with ML engineers without creating friction
The PM behaviors that create friction in ML partnerships are usually well-intentioned. "Make it smarter." "Can we improve the accuracy?" "Why is it getting this wrong?" None of these are actionable. They signal that the PM doesn't understand the system well enough to give useful direction, which puts the ML engineer in the position of guessing what the PM wants while also doing the technical work.
The PM behaviors that move things forward are specific, grounded in data, and separate the product question from the technical question.
Bring failure examples, not failure summaries. "The model is getting support tickets wrong" is a summary. Showing an ML engineer 15 specific tickets the model misclassified, with your annotation of why each one should have been classified differently, gives them something to work with. The pattern in the failures tells them more than any metric. This is work the PM should do before the conversation, not during it.
Specify the tradeoff you want to make. Post 8 covered precision vs recall in depth. In the ML partnership context, the application is: when you ask for improvement, tell the engineer which dimension matters more for your users. "We're seeing too many false positives on the fraud model and users are getting frustrated with declined transactions. I'd rather miss some fraud than keep blocking legitimate users at this rate." That's a direction. "Improve the fraud model" is not.
Give the quality bar, not the technical approach. The PM's job is to say "we need 90% precision before we can ship this to all users" not "have you tried adding more layers to the model." One gives the engineer freedom to find the best technical path to the product goal. The other creates unnecessary constraint around an area where PM intuition is usually less reliable than ML engineer expertise.
Ask for the confusion matrix before the aggregate metric. Aggregate accuracy hides things. A model that's 91% accurate overall but fails on 40% of a specific user segment is not a 91% accurate model for that segment. Before any model ships, the PM should have seen the confusion matrix and the disaggregated performance across the segments that matter. This is a PM responsibility, not an ML engineer responsibility. The engineer will optimize for what you tell them to optimize for. If you don't ask for segment-level analysis, you won't get it.
The pushback that moves things forward sounds different from the pushback that creates conflict. "We need six more months" from an ML team is a statement that needs to be broken down rather than accepted or rejected. What changes in six months? If the answer is "accuracy goes from 82% to 87%," the PM's job is to ask whether users would notice a 5% accuracy improvement, and whether six months of learning from a shipped 82% model might get them to 87% faster than six months of pre-launch optimization. "We need six more months to collect more data" is a different answer and requires a different response. Fast feedback loops in production often collect data faster than pre-launch labeling projects, which is an argument for shipping at 80% and improving.
Translating between business and ML
The translation problem between what a business stakeholder says and what an ML engineer hears is where product requirements go to die. It happens in both directions.
Business language tends to be outcome-oriented and vague. "We want users to find what they're looking for faster." What the ML engineer hears is unclear. Faster how? By what measure? In which part of the product? For which users?
ML language tends to be metric-oriented and specific. "We improved mean reciprocal rank by 12%." What the business stakeholder hears is unclear. Does that mean users are happier? Are they finding what they need? Is the metric moving in a direction that helps the business?
The PM sits in the middle and is responsible for both translations. Incoming from the business, the job is to convert the outcome into a measurable ML objective with a specific success threshold. Outgoing to the business, the job is to connect the ML metric to the outcome it proxies for, with an honest assessment of where that proxy breaks down.
A few translations worth having in your repertoire:
"Reduce support ticket volume" translates to: build a model that correctly resolves at least 70% of Tier 1 tickets without human escalation, measured by escalation rate, with a false positive threshold below 5% so users aren't getting incorrect automated responses.
"Make search better" translates to: improve the click-through rate on the first result from 34% to 45%, measured on a holdout set representative of the top 10 query categories by volume, without degrading result diversity below the current baseline.
"Personalize the experience" translates to: build a recommendation model that surfaces at least one item the user engages with in the first session, measured by session click-through rate, with a cold-start solution for users with fewer than five interactions.
Each of these translations has three parts: the measurable objective, the specific metric, and the constraint that prevents optimizing the metric at the expense of something else that matters. The constraint is the part most PM-to-ML translations skip, and it's where the unexpected consequences described in Post 8 tend to originate.
Stakeholder expectations: the conversation to have before you have to
AI projects have timeline and quality expectations that don't match the reality of how ML development works, and the gap creates stakeholder problems that are almost always avoidable with an upfront conversation.
The realities worth communicating before a project starts, not after it's late:
The first version will underperform relative to the eventual version. This isn't a failure of planning. It's how ML works. The model improves as it sees more production data, as the edge cases get labeled, as the prompts get refined. Framing v1 as "the version we learn from" rather than "the finished product" changes the expectations around its quality without lowering them.
Accuracy in testing will not match accuracy in production. Post 5 covered overfitting and distributional shift in detail. In the stakeholder context, the implication is that the accuracy number in the demo is not the accuracy number users will experience at launch. Setting that expectation explicitly, with the monitoring plan that will detect when production performance drifts, turns a potential crisis into an anticipated event with a response plan.
Timeline estimates for ML work carry more uncertainty than estimates for traditional software. The reason isn't that ML engineers are less reliable. It's that model quality has diminishing returns that are hard to predict, data quality problems surface late, and the evaluation infrastructure that confirms the model is ready takes time to build. Communicating this as a structural feature of ML projects, not an excuse for a specific delay, tends to land better with stakeholders who haven't been through an ML project before.
What comes next
Scoping gets the feature defined. Shipping gets it to users. What happens after it ships determines whether it keeps getting better or quietly degrades while the team works on something else.
Post 11 covers the operational work that runs after launch: the deployment strategies that reduce risk, the monitoring frameworks that catch problems before users notice them, multi-stage pipeline observability, and the MLOps maturity model that determines how fast your team can actually improve the product once it's in production. It's the least glamorous part of AI product work. It's also the part that separates teams that build things that get better from teams that build things that get old.





