Skip to main content

Command Palette

Search for a command to run...

Designing AI Agents: How Much Autonomy Is Actually Safe to Ship?

Updated
16 min read
Designing AI Agents: How Much Autonomy Is Actually Safe to Ship?
R
Senior Product Manager writing about two sides of AI: building AI products that work at scale, and using AI to work more effectively as a PM. I share frameworks for Applied AI product management—economics, evaluation, agent design, responsible deployment—alongside practical guides for AI-powered productivity, workflows, and decision-making. If you're building AI products or figuring out how to leverage AI in your PM workflow (or both), this is for you. Currently based in Seattle.

Part 19 of the Applied AI Product Management series. This post covers the product category where every decision covered in this series becomes more consequential: AI agents. Not assistants that respond to questions, but systems that plan, use tools, and take actions in the world on behalf of users.


A team built a research agent that gathered competitive intelligence. In demos it was impressive — it searched the web, extracted relevant information, synthesized findings, and produced structured reports. The demo success rate was excellent.

They shipped it. Within two weeks, the agent was occasionally sending requests to competitor websites with obvious patterns that revealed the company's research interests. It sometimes got stuck in retrieval loops that consumed significant compute before timing out. When it hallucinated a fact and embedded it in a report, the error wasn't obvious because it appeared alongside correctly researched information.

None of these failures would have happened with a simpler assistant that answered questions on request. The agent's autonomy — the feature that made it impressive — was also what made its failures different in kind, not just degree. An assistant that answers a question incorrectly can be corrected. An agent that takes a sequence of actions based on an incorrect assumption has already changed the state of the world before anyone noticed.

This is the central challenge of agentic product design. Autonomy is the feature. Autonomy is also what makes failures more consequential, harder to catch, and more complex to recover from.


The compound AI shift

For three years, the dominant mental model for AI products was a simple one: user inputs something, model outputs something. Better model, better output. The product decisions were about which model, what prompts, and how to display the result.

That mental model is insufficient for understanding what most production AI systems actually are in 2026. Stop judging the model in isolation. Most gains come from orchestration: retrieval, tools, memory, verification, and guardrails. The system that delivers value to users is not a model. It's a composition of models, retrieval systems, tools, memory layers, validation steps, and human oversight mechanisms. The model is one component.

This shift — from single model to compound system — changes what PMs need to understand. The question isn't "which model should we use?" It's "how do we design the system around the model so that it's reliable, safe, and improvable?" Context engineering, covered briefly in Post 4 as the evolution of prompt engineering, is the practice of designing that system deliberately. What goes into the context window — retrieved documents, tool outputs, conversation history, memory summaries, system instructions — determines what the model can reason over and therefore what it can reliably do.

The best agentic products are the ones where context engineering is as deliberate as model selection. The teams that treat context as an afterthought and spend all their attention on model choice are optimizing the wrong variable.


The reliability math that changes everything

Before designing an agent, run this calculation.

A single-step task completed at 95% reliability is excellent. Users encounter a failure 1 in 20 times. Most products can absorb that.

A 5-step agentic task where each step is 95% reliable has an end-to-end success rate of 77%. Nearly one in four attempts fails.

A 10-step agent at 95% per-step reliability has an end-to-end success rate of 60%. Four out of ten attempts fail before completing.

A 20-step agent at 95% per-step reliability: 36% success. Nearly two-thirds of attempts fail.

This is why the narrower the scope, the more reliable the agent. An agent that does one thing extremely well is almost always more valuable than an agent that does ten things adequately. The reliability math makes this intuitive once you see it. The instinct to build a general-purpose agent that can handle anything runs directly into the compound probability problem that makes general-purpose agents unreliable in practice.

The PM implication: define agent scope by the number of steps required, not by the breadth of tasks handled. A customer support agent that classifies an incoming ticket, retrieves the relevant policy, generates a draft response, and formats it for review is four steps. That's a good agent scope. A customer support agent that classifies the ticket, decides whether to escalate, looks up account history, applies discount logic, generates a personalized response, checks against compliance rules, formats for the channel, and updates the CRM is eight steps. Each additional step multiplies the reliability cost.

Start with the minimum step count that delivers the core value. Add steps when the simpler version has proven reliable in production, not before.


When agents beat assistants

Agents are not always better than assistants. They're better for a specific set of task characteristics, and using them outside those characteristics creates the complexity and reliability problems above without the corresponding value.

Use an agent when three conditions are true simultaneously.

The task is multi-step and the correct sequence of steps depends on intermediate results. If you can specify the exact steps in advance regardless of what happens, it's a workflow, not an agent. If the agent needs to decide what to do next based on what it just found, that's the reasoning capability that agents add.

The task involves tools or external systems that need to be called. Searching the web, querying a database, sending a message, updating a record — any task that requires actually doing something in the world rather than just generating text benefits from the tool-use architecture agents are built for.

The user genuinely wants to delegate an outcome rather than ask a question. "Research this topic and give me a report" is a delegation. "What are the key findings in this field?" is a question. The first wants an agent. The second wants an assistant. Getting this wrong in both directions is common: building an agent for questions (adding unnecessary complexity) and building an assistant for delegations (producing outputs the user then has to manually act on).

Use an assistant — not an agent — when the task is a single step, when the user wants to stay in control of each decision, when the output needs human review before any action is taken, or when reliability requirements are higher than what a multi-step agent can deliver. Most product features are better served by an excellent assistant than a fragile agent. The glamour of agents has led many teams to over-build when a well-crafted assistant would have served users better and been easier to maintain.


MCP: the protocol that connects agents to the world

Every agent needs to do things: read files, query databases, call APIs, send messages, execute code. Before MCP, each of those connections required a bespoke integration written specifically for each model and each tool. Ten tools for five models was fifty custom integrations. Each one needed to be maintained separately. Switching models meant rebuilding integrations.

MCP defines how to invoke tools. Think of it as USB-C for AI tools. Connect once, works everywhere that speaks MCP. Over 200 server implementations exist: GitHub, Slack, Google Drive, PostgreSQL, Notion, Jira, Salesforce, and more.

The product implications are significant and immediate. An agent built on MCP can connect to any MCP-compliant tool without custom integration work. When you switch from one model to another, your tool connections stay intact. When a new tool releases an MCP server, your agent can use it without any changes on your side.

For PMs the most important implications are strategic, not technical. In an agentic world, tool descriptions are the new interface. The model never sees your UI. The only thing it reads to understand what a tool does is the name, description, and parameter schema. This means PMs need to own tool descriptions with the same care applied to copy and UX. A poorly written tool description produces an agent that misuses the tool. A well-written one produces an agent that uses it precisely.

The second strategic implication: should your product expose an MCP server? If your product contains data or capabilities that users of AI agents would want to access — a CRM, a project management tool, a knowledge base, a data warehouse — exposing an MCP server makes your product usable by every MCP-compatible agent in the ecosystem. SaaS products that become agent-ready in 2026 will win the enterprise deals that require AI interoperability. The ones that don't will spend 2027 explaining to their board why they're losing RFPs. The decision of whether to build an MCP server is a product strategy decision, not an engineering decision.

The security considerations: Post 14 covered MCP-specific attack vectors in the red teaming context. In the architecture context, the mitigation principles are: scope tool permissions to the minimum required, require explicit authorization for write operations, validate all tool inputs server-side rather than trusting the model to submit correct parameters, and maintain an audit log of every tool invocation. Agents with broad tool permissions and no audit trail are a security and governance liability regardless of how well they perform in demos.


A2A: the protocol that connects agents to each other

MCP handles how one agent talks to tools. It doesn't handle how two agents talk to each other. That's what A2A addresses.

MCP's client-server model gives you reliable, auditable, typed tool interfaces. A2A's peer-to-peer discovery model gives you dynamic capability negotiation between agents that may not know about each other at design time. These are fundamentally different needs.

The practical scenario where A2A matters: a general-purpose research agent needs specialized legal analysis. Rather than the research agent having all legal knowledge itself, it can discover and delegate to a specialized legal agent via A2A. The legal agent has its own tools, its own context, and its own guardrails. The research agent doesn't need to know how legal analysis works. It just needs to know that a legal agent exists, what it can do, and how to ask it.

The agent interoperability picture in 2026 breaks into three layers. Layer 1 — Tool Integration: how a single agent connects to external capabilities. This is MCP's domain. Layer 2 — Agent Coordination: how agents discover each other and exchange results across organizational boundaries. This is A2A's domain. Layer 3 — Identity and Trust: how agents verify each other.

For most products building their first agent, A2A is not an immediate requirement. It becomes relevant when multiple specialized agents need to collaborate, when agents from different vendors or teams need to interoperate, or when the product is building an ecosystem that third-party agents will participate in. The right time to think about A2A architecture is before building the agent coordination layer, not after discovering that agents can't talk to each other.


Agent memory: what your agent can and cannot remember

An agent with no memory is stateless. Every conversation starts fresh. The agent has no knowledge of what it did last session, what the user has told it before, or how the task has evolved over time. For simple single-session tasks, this is fine. For anything involving continuity — a research project that spans days, a customer relationship that spans months — it's a fundamental limitation.

Memory architecture for agents breaks into four types, each appropriate for different product contexts.

In-context memory is whatever fits in the current context window. Conversation history, tool outputs from earlier in the session, documents loaded for this task. It disappears when the session ends. This is sufficient for simple, self-contained tasks and has zero additional infrastructure cost.

External storage is a database the agent can read and write explicitly. The agent stores facts, decisions, and task state that it wants to remember across sessions. It retrieves them when relevant to the current task. This enables continuity across sessions but requires the agent to decide what's worth storing and to retrieve the right things when needed. Vector databases (covered in Post 7) are the standard implementation.

Semantic memory is a layer of learned facts about the user, their preferences, and their context — built up over time from interactions. "This user prefers concise answers." "This user works in healthcare compliance." This is the memory layer that enables personalization without requiring users to re-explain their context in every session.

Episodic memory records what the agent did in previous tasks and what the outcomes were. It enables the agent to improve from experience — to avoid approaches that failed before and to repeat approaches that succeeded. This is the most valuable and the most complex memory type for most product contexts.

The PM question for each memory type: does the value of maintaining this memory justify the cost and complexity of the implementation? In-context memory is free. External storage requires infrastructure and governance. Semantic memory raises privacy questions about what the agent knows about users. Episodic memory requires a system for evaluating whether past outcomes should inform future behavior.


The agent frameworks landscape

The choice of agent framework is a development decision, but it has product implications that PMs should understand.

The landscape splits into two categories: provider-native SDKs optimized for one model family, and independent frameworks that work across providers.

Provider-native frameworks: Claude Agent SDK (deepest MCP integration, strongest OS access), OpenAI Agents SDK (clean handoff model for triage and specialist flows, built-in guardrails), Google ADK (native A2A support, managed deployment on Vertex AI).

Independent frameworks: LangGraph (persistent checkpointing, crash recovery, the strongest answer to "what happens when step 7 fails"), CrewAI (fastest path from idea to prototype, natural language role definition, good for validating agent concepts before committing to architecture).

The PM decision framework: if model flexibility matters — if you might switch models or use different models for different tasks — choose an independent framework. If you're committed to one model family and want the deepest integration and the most production-hardened implementation, choose the provider-native SDK.

Do not start with multi-agent systems. Start with simple workflows and graduate. Do not expect production quality from a demo. Budget four to six months for complex agent systems. The frameworks that produce impressive demos quickly are not always the frameworks that produce reliable production systems. The faster to prototype, the more likely the framework is hiding complexity that surfaces later.


Guardrails that actually work

Post 14 covered red teaming for AI products. Agents introduce specific guardrail requirements that go beyond what standard safety filters address.

Builders who embed security logic inside prompts have already lost. Guardrails must live outside the LLM. Kill switches for tool calls cannot depend on model behavior alone.

The guardrail architecture that works in production operates at four levels.

Input guardrails validate what the agent receives before it starts reasoning. Malformed inputs, inputs that match known attack patterns, inputs that are outside the agent's defined scope — block these before any model call. This is cheap (no inference required) and catches the most common abuse patterns.

Tool call validation checks every tool invocation before execution. Does the agent have permission to call this tool? Are the parameters within expected ranges? Does this call match the pattern of legitimate use for this agent? This is the layer that prevents the agent from using legitimate tools in unintended ways based on adversarial inputs.

Action approval gates require human confirmation before irreversible or high-stakes actions. The definition of "irreversible" and "high-stakes" should be explicit in the product spec, not left to the agent to determine. Sending an email: approval gate. Querying a database: no gate needed. Deleting a record: approval gate. Making a purchase: approval gate. The list should be enumerated by the PM, not inferred by the model.

Output monitoring reviews what the agent accomplished against what it was asked to do. Not every session, but sampled — and every session that triggered an error or approval gate. This is the learning layer that catches systematic failures that single-session monitoring misses.

The meta-principle: the more powerful the agent, the more important it is that the guardrails are structural rather than instructional. A guardrail that says "don't delete files unless explicitly asked" in the system prompt is a suggestion. A guardrail that requires human approval for any delete operation is an architectural constraint. Only the latter is reliable under adversarial conditions.


Agentic product metrics

Standard assistant metrics — acceptance rate, regeneration rate, conversation depth — don't fully capture agentic product quality. Agents need additional metrics that reflect the multi-step, action-taking nature of what they do.

Task completion rate is the foundational metric. What percentage of delegated tasks does the agent complete without human intervention? This is the agent equivalent of the assistant's acceptance rate — the measure of how reliably the product delivers on its core promise. Anything below 80% on well-defined tasks signals that the agent scope is too broad, the reliability of individual steps is too low, or both.

Step success rate disaggregates task completion by step type. An agent that fails consistently at step 3 across many different tasks has a specific step 3 problem. An agent with uniform failure distribution across steps has a more general reliability problem. The diagnostic value of step-level metrics is what makes the multi-stage pipeline observability framework from Post 11 so directly applicable to agentic products.

Human intervention rate measures how often users override the agent mid-task, pause it to correct something, or invoke an approval gate. High intervention rate on routine tasks signals that the agent's judgment on those tasks doesn't match user expectations. It's also an opportunity: high-intervention steps are candidates for explicit human approval gates, which converts a frustrating surprise into an expected checkpoint.

Time to completion versus manual baseline is the metric that makes agentic value visible in business terms. If the agent completes a research task in 12 minutes that takes a human 3 hours, the business case is clear. If it takes 45 minutes and requires 20 minutes of human correction, the business case is much weaker. This metric requires measuring the manual baseline first, which many teams skip.


What comes next

You now have the full technical and product picture of agentic AI — the protocols that connect agents to tools and to each other, the memory architectures that give agents continuity, the guardrails that make autonomy safe to ship, and the metrics that tell you whether it's working.

The remaining posts shift to the strategic questions this series has been building toward. Post 20 covers a scenario most existing product teams face: not building a new AI product from scratch, but integrating AI into an existing one. The decisions are different, the risks are different, and the failure modes are different from what most AI PM content addresses. Post 21 covers defensibility — what makes an AI product actually hard to compete with over time. Post 22 closes the series with the interview preparation synthesis: how to demonstrate everything this series has covered under the specific pressures of a senior PM interview.