One Reasoning Call Can Cost $75,000 a Month

Part 15 of the Applied AI Product Management series. Posts 6 and 8 touched on AI economics at a conceptual level. This post does the actual math. A PM who can describe the cost structure of an AI feature is informed. A PM who can model it, stress-test it at scale, and identify where the economics break is dangerous in the best possible sense.

A team launched an AI writing assistant. Per-query cost looked manageable during development. The feature went viral on launch day. By the end of the week, the model bill was four times what the team had projected for the entire month.

Nothing had gone wrong technically. The model was working exactly as designed. The problem was that the economics had been modeled at average usage, not at the tail usage that actually showed up when real users got access to the feature. A small percentage of users sent extremely long documents. Another group hit the feature hundreds of times per day. Neither scenario had been stress-tested in the cost model before launch.

This isn't unusual. It's the norm. AI economics break in specific, predictable ways that become obvious in hindsight and invisible in the planning spreadsheet that was built on average assumptions.

This post is the planning spreadsheet done right.

Understanding what you're actually paying for

Every API call to a language model costs money in two directions: input tokens and output tokens. Input tokens are everything the model reads — your system prompt, conversation history, retrieved documents, and the user's message. Output tokens are what the model generates in response.

Pricing is quoted per million tokens. A token is roughly four characters of English text, which works out to about 750 words per million tokens. One typical page of text is roughly 500 tokens. A 30-message conversation with context is often 3,000 to 8,000 tokens depending on message length and how much context you carry forward.

Output tokens cost more than input tokens on every major model. The ratio varies, but a factor of three to six times more expensive for output is typical. This matters for product design. A feature that asks the model to generate long responses costs significantly more per query than one that asks for short, structured outputs. If your product doesn't need long responses, constraining output length is one of the cheapest cost optimizations available.

Here is the current pricing landscape as of May 2026, using official API rates:

Frontier tier for complex reasoning and long-horizon tasks: Claude Opus 4.7 at $5 input and $25 output per million tokens. GPT-5.4 at $2.50 input and $15 output. Gemini 3.1 Pro at $2 input and $12 output.

Mid-tier for most production workloads: Claude Sonnet 4.6 at $3 input and $15 output. GPT-5.4 mini in a similar range. Gemini 2.5 Flash at $0.15 input and $0.60 output — an exceptional price-to-performance ratio for workloads that don't require frontier reasoning.

High-volume efficiency tier for classification, routing, and simple extraction: Claude Haiku 4.5 at $1 input and $5 output. Gemini 2.5 Flash-Lite at $0.10 input and $0.40 output. GPT-5.4 nano even cheaper. DeepSeek V3.2 at $0.27 input and $1.10 output for teams with lower latency requirements or data residency needs met by self-hosting.

The spread between the cheapest and most expensive options is roughly 50 to 100 times on a per-token basis. Building the right cost architecture is not a minor optimization. It is often the difference between an AI feature with healthy margins and one that loses money at scale.

The reasoning model cost trap

Reasoning models — Claude's extended thinking mode, GPT-5.4 Pro, o3 — produce significantly better outputs on complex multi-step tasks by generating intermediate reasoning steps before answering. The quality improvement for the right tasks is real and meaningful.

The cost trap: reasoning models charge for every hidden thinking token at the output rate. A single GPT-5.4 Pro call can burn 50,000 output tokens before producing a one-paragraph answer. HeroHunt

That is not a typo. If extended thinking produces 50,000 tokens of internal reasoning before generating a 300-word response, you pay for 50,000-plus tokens at the output rate. At GPT-5.4 Pro's pricing, that's $9 for one query. GPT-5.4 Pro is $30 input and $180 output per million tokens — 12 times the cost of standard GPT-5.4 on output tokens. Mordor Intelligence

Reasoning models make sense for tasks where the quality difference justifies the premium. Legal document review. Complex financial analysis. Multi-step debugging. Architectural decisions. These are tasks where users would pay significantly more for a materially better output, and where the query volume is low enough that the per-query cost is manageable.

They do not make sense for high-volume conversational features, simple classification, content generation at scale, or any task where the quality difference between standard and reasoning models is not visible to users. Using a reasoning model for customer support routing is like hiring a tax attorney to answer general customer questions. Technically capable. Economically indefensible.

The practical rule: treat reasoning model tokens as a separate budget item with explicit justification. Before any feature uses extended thinking or reasoning mode, the question to answer is: what specific quality improvement does the reasoning provide, and is that improvement worth the 10 to 50 times cost premium over standard generation?

Three worked examples

Example 1: Customer support chatbot

Product: AI-powered customer support for a SaaS product. Users ask questions, the model retrieves relevant help articles and generates responses.

Assumptions per conversation: System prompt of 500 tokens. Retrieved help articles of 1,500 tokens. Three conversation turns averaging 200 tokens each of user input and 300 tokens each of model output. Total: 3,200 input tokens and 900 output tokens per conversation.

At Claude Sonnet 4.6 ($3/$15): input cost of $0.0096, output cost of $0.0135. Total cost per conversation: $0.023, roughly 2.3 cents.

At 50,000 conversations per month: $1,150 per month in model costs.

At 500,000 conversations per month: $11,500 per month.

At 5,000,000 conversations per month: $115,000 per month.

Now model the same feature with tiered routing. Simple queries routed to Gemini 2.5 Flash-Lite ($0.10/$0.40). Assume 70% of conversations are simple enough for the cheaper model, 30% require the mid-tier.

Blended cost per conversation: 0.7 times $0.0006 plus 0.3 times $0.023 equals $0.0073, roughly 0.73 cents.

At 5,000,000 conversations per month: $36,500 versus $115,000. That's $78,500 per month in savings, or $942,000 annually, for the same product quality on the 70% of conversations that don't need the mid-tier model.

Example 2: Document analysis feature

Product: An AI feature that analyzes uploaded contracts and extracts key terms. Documents average 8,000 tokens. The model returns structured JSON with 20 fields averaging 600 output tokens.

Per document: 8,500 input tokens and 600 output tokens.

At Claude Sonnet 4.6: $0.0255 input plus $0.009 output equals $0.0345 per document.

At 10,000 documents per month: $345.

At 100,000 documents: $3,450.

At 1,000,000 documents: $34,500.

Now apply prompt caching. The system prompt and extraction instructions are identical across every document — only the document content changes. Claude Opus 4.6 cached reads are $0.50 per million tokens versus $5 for standard input — a 90% reduction on cached input. Claude Sonnet 4.6 caching similarly reduces cached portions to $0.30 per million for cache writes and $0.03 per million for cache reads. o-mega

If the system prompt and instructions total 1,000 tokens and are cached, and only the 7,500-token document is processed as standard input: the effective input cost drops from processing 8,500 standard tokens to processing 7,500 standard tokens plus 1,000 cached tokens. Cache reads at $0.03 per million versus $3 standard. On a high-volume document processing pipeline, this optimization alone cuts costs meaningfully.

The broader lesson: when your system prompt is long and reused across many calls, caching it is one of the highest-return optimizations available. Most teams who run high system prompt costs haven't implemented caching. It takes one engineering sprint. The payback is immediate.

Example 3: Agentic workflow

Product: An AI agent that researches a topic, drafts a report, and creates a structured summary. Multiple tool calls, web searches, and generation steps.

Agentic workflows compound costs in ways single-turn features don't. Each tool call adds input tokens when results are returned to the model. Long conversations maintain growing context windows. Multi-step reasoning can generate substantial intermediate output.

A rough model: five web search tool calls returning 800 tokens each, plus one 3,000-token final generation, plus 2,000 tokens of accumulated conversation context. Total: roughly 7,000 input tokens and 3,000 output tokens per workflow execution.

At Claude Opus 4.7 ($5/$25): $0.035 input plus $0.075 output equals $0.11 per workflow execution.

At 10,000 workflow executions per month: $1,100.

At 100,000 executions: $11,000.

This sounds manageable. Now consider what happens when the workflow includes a reasoning step. If extended thinking generates 30,000 thinking tokens at the output rate of $25 per million: that single reasoning step adds $0.75 per execution. At 100,000 executions: $75,000 per month from one reasoning call per workflow.

Agentic cost modeling requires analyzing each step individually, not treating the workflow as a single unit. The step that uses reasoning mode is the line item that dominates the budget. This is why agentic features need per-step cost modeling before launch, not aggregate cost estimates.

Optimization levers in priority order

These are the optimizations that have the highest return relative to implementation cost, ordered by impact.

Tiered model routing comes first. Classifying queries and routing simple ones to cheaper models is the single highest-impact optimization for most products. A lightweight classifier — which can itself run on a very cheap model — determines which tier each query should use. The implementation cost is one sprint. The cost reduction at scale is often 50 to 80 percent.

Prompt caching comes second for features with long, repeated system prompts. If your system prompt exceeds 1,000 tokens and is reused across many calls, caching reduces the effective input cost dramatically. Every major provider now supports caching on flagship models.

Output token constraints come third. Set explicit maximum output token limits. Users rarely need responses longer than a few hundred tokens for most use cases. A feature that defaults to unlimited output will occasionally generate 2,000-word responses to simple questions, burning output budget unnecessarily.

Batch processing comes fourth for any non-real-time workload. Both Anthropic and Google offer batch APIs at 50% discount for asynchronous processing. Document analysis, content moderation pipelines, data extraction at scale — these should run through batch APIs, not real-time APIs.

Context window management comes fifth. Conversation history that grows indefinitely is a cost that compounds with every turn. Summarizing history after a set number of turns, as covered in Post 4, keeps the effective context window bounded even for long-running conversations.

The unit economics framework

Cost modeling doesn't live in isolation. It connects to revenue and user value through unit economics — the relationship between what a feature costs to deliver and what value it generates per user.

The framework has four inputs: cost per query (from the modeling above), queries per active user per month (from usage analytics or estimates), revenue or cost savings per active user per month (what the feature is worth), and the margin target (what percentage of revenue can come from model costs).

A feature that generates $10 in monthly subscription revenue per active user, with a 20% model cost tolerance, can support up to $2 per user per month in model costs. If active users average 50 queries per month, that's $0.04 maximum per query. That constraint either validates the architecture (if the current model cost is under $0.04 per query) or tells you the architecture needs to change before the feature can be economically sustainable at scale.

This calculation is simple. Most teams don't run it before launch. They run it after the first bill arrives and discover the feature is profitable at 10,000 users and underwater at 100,000 users.

The break-even analysis extends this to the scale question: at what user volume does the feature become profitable accounting for both model costs and the fixed costs of building and maintaining it? If the feature cost $200,000 to build, generates $3 per month in net revenue per active user after model costs, and requires $10,000 per month in fixed infrastructure to run: the monthly break-even is roughly 3,400 active users. Below that, the feature loses money. Above it, the margin improves with scale. Knowing this number before launch tells you whether the feature is economically viable at your realistic growth projections.

The cost model is a product decision

The economics of an AI feature are not an engineering concern that PMs hand off after scoping. They're a product decision that determines which architecture is viable, which model tier is appropriate, which optimizations are necessary before launch, and whether the feature is worth building at all given realistic scale projections.

A PM who can hand an engineer a cost model with specific per-query targets, tiering logic, and break-even assumptions is giving the team a design constraint that produces a sustainable architecture. A PM who says "make it cost-effective" is leaving a critical product decision to whoever writes the first line of infrastructure code.

The models above are not complex. They require a spreadsheet and an hour. Run them before committing to architecture, not after the first month's bill arrives.

What comes next

You now have the financial foundation — cost modeling, unit economics, and the optimization levers that make AI features sustainable at scale.

The next post shifts from how much AI products cost to what they've produced: six case studies of products that made consequential AI decisions well, analyzed through every framework this series has built. By this point in the series, you have the mental models to read these decisions as a PM would — not as historical trivia, but as a demonstration of the product thinking that separates category-defining products from expensive experiments.

The Economics of AI Products: Cost Modeling Every PM Should Be Able to Run