Skip to main content

Command Palette

Search for a command to run...

AI UX: The Design Patterns Behind Products Users Trust

Updated
18 min read
AI UX: The Design Patterns Behind Products Users Trust
R
Senior Product Manager writing about two sides of AI: building AI products that work at scale, and using AI to work more effectively as a PM. I share frameworks for Applied AI product management—economics, evaluation, agent design, responsible deployment—alongside practical guides for AI-powered productivity, workflows, and decision-making. If you're building AI products or figuring out how to leverage AI in your PM workflow (or both), this is for you. Currently based in Seattle.

Part 17 of the Applied AI Product Management series. The previous posts covered how to build, measure, deploy, and economically model AI features. This post covers how users actually experience them — the interface patterns that determine whether users trust what the AI produces, and the metrics that tell you whether that trust is developing or eroding.


Traditional UX design rests on one assumption so fundamental it's rarely stated: the same input produces the same output. Press a button, get a result. Enter a form, get a confirmation. The system is deterministic. Design for one state and you've designed for all of them.

AI breaks this assumption entirely. The same input produces different outputs depending on temperature settings, context window contents, model version, and a degree of inherent randomness. The interface must communicate something traditional UX never had to: that the product is operating probabilistically, not deterministically, and that this is expected behavior rather than a bug.

Every design pattern in this post is a response to that fundamental difference. The teams that treat AI interfaces like traditional software interfaces produce products that feel broken when the model varies, untrustworthy when it hedges, and surprising when it fails. The teams that design specifically for probabilistic systems produce products that feel honest, controllable, and worth coming back to.


Streaming: designing for the model that thinks out loud

Post 4 covered streaming as a technical implementation. Post 6 covered it as an infrastructure decision. Neither covered the design questions that make or break the streaming experience.

The core insight is that streaming shifts the user's relationship with latency from waiting to reading. A response that arrives all at once after eight seconds feels slow. The same response appearing word by word starting 400 milliseconds after the user submits feels fast — not because it is faster, but because the user is engaged from the first word rather than staring at a blank screen.

This is why ChatGPT's streaming interface was one of its most consequential early product decisions. The underlying models were slow by the standards of every other software users interacted with. Streaming made that slowness tolerable. Users were processing the first sentence while the third was generating. By the time the full response appeared, they'd already formed an opinion about whether it was useful.

The design decisions that make streaming work:

Token-by-token versus sentence-by-sentence rendering affects the experience significantly. Token-by-token — the pattern most chat interfaces use — produces the fastest perceived responsiveness and the characteristic AI typing effect. Sentence-by-sentence rendering is smoother but adds a brief delay before each unit of content appears. For conversational interfaces, token-by-token is almost always right. For document generation or longer-form output, sentence-by-sentence avoids the visual jitter of individual words appearing rapidly across multiple lines.

Interruption design matters more than most teams realize. Users should be able to stop a generation in progress without it feeling like an error state. A visible stop button that appears during generation, combined with graceful handling of partial responses, tells the user they're in control of the interaction rather than waiting for the model to finish. Products that don't offer interruption force users to wait for a response they've already decided they don't want, which trains them to write shorter prompts rather than to explore the model's capabilities.

Partial response quality shapes the impression of the full response. If the first sentence of a streamed response is weak, users form a negative judgment before the model reaches its strongest reasoning. Prompting strategies that front-load the most relevant content — answer first, reasoning second — produce a better streaming experience than those that build toward a conclusion.


Loading states: the design of productive waiting

What you show between the user's action and the model's first output determines whether the product feels broken or responsive. Most AI products underinvest in this design space and pay for it in user trust.

The most common failure: a spinner that offers no information about what's happening. Spinners are appropriate for sub-second waits. For the one to five second waits common in AI generation, they communicate only "something is happening" — which is the minimum possible useful information. Users facing a spinner for three seconds don't know if the model is still working, if there was an error, or if the product has hung.

The patterns that work at different time scales:

Zero to one second: a subtle pulse or skeleton screen. Users expect sub-second response from software. If AI is processing faster than a second, match the visual pattern of traditional software — a brief indicator that something is happening, no more.

One to three seconds: status text that describes the current operation. "Analyzing your document." "Searching for relevant sources." "Generating response." This isn't just reassurance — it's a progress report that tells users what's happening and, implicitly, that the system is working as intended. Perplexity's pattern of showing sources appearing before the answer generates is a strong execution of this: users see evidence of retrieval happening before generation begins, which builds confidence in the grounding of the response.

Three to ten seconds: intermediate results where possible. Show what's available while the rest generates. For a research assistant, show retrieved sources while synthesis is happening. For a document analyzer, show the document being highlighted while analysis runs. Progressive disclosure converts waiting into reading, which is always preferable.

Ten seconds or longer: time estimates and cancel options become necessary. A user who doesn't know whether to wait another ten seconds or another two minutes makes different decisions. Showing approximate remaining time, allowing background processing with a notification on completion, or surfacing partial results early — any of these is better than an indefinite wait.

The error state that most teams forget to design: the model returns a result, but it's clearly not what the user wanted. This is not a technical error. The model worked correctly. The output was simply wrong for the user's intent. A "try again" or "refine this" affordance adjacent to the output acknowledges that AI outputs are starting points rather than final answers. Products that don't offer this force users to manually reconstruct their prompt and submit again — a friction that teaches them the product is unreliable rather than iteratable.


Error handling and graceful degradation

AI errors are categorically different from traditional software errors, and treating them the same way produces responses that feel tone-deaf.

A traditional software error is the system failing to do something it should be able to do. "Payment failed." "File not found." "Connection timeout." The appropriate response is a clear message explaining what failed and how to fix it.

An AI error is often the model declining to do something, being uncertain about something, or producing output that doesn't match what the user wanted. None of these map to traditional error patterns. "I couldn't help with that request" is not a system failure. It's a capability boundary. "I'm not certain about this" is not an error. It's appropriate calibration. "This isn't quite what you wanted" is not a failure state. It's the starting point for iteration.

The design distinction that matters: confidence indicators should be part of the normal output design, not part of error handling. A model that shows confidence levels alongside its outputs — through visual differentiation, hedging language, or explicit uncertainty disclosure — normalizes uncertainty as part of the product rather than treating it as a failure. Bing Chat's citation pattern, showing which claims are grounded in sources and which are the model's inference, is one of the cleaner executions of this. Users can see immediately what to trust and what to verify.

Refusal design deserves particular attention because it's where many AI products alienate users who have legitimate needs. A refusal that says only "I can't help with that" without explanation forces users to guess whether their request crossed a safety threshold, whether it's outside the model's capability, or whether there's a phrasing that would work. Refusals that explain the boundary — "I can't provide specific legal advice, but I can explain the general legal principles relevant to your situation" — preserve the user relationship and redirect toward what the product can do. The difference is the difference between a product that feels paternalistic and one that feels honestly limited.

Graceful degradation for agentic features requires an additional layer of thought. When a multi-step agent fails partway through a task, the failure design must answer: what did the agent accomplish before failing, what state has the world been left in, and what does the user need to do to either resume or recover? An agent that deletes three files, fails on the fourth, and returns "an error occurred" has left the user worse off than before they started. Showing what was completed, preserving the ability to undo completed steps, and providing a clear path to resume from the failure point converts a trust-destroying experience into a recoverable one.


Human-in-the-loop interface patterns

Post 8 covered automation vs control as a product tradeoff. The interface design question is how to make the right level of control feel effortless rather than burdensome.

The automation continuum from Post 8 — from AI assists through AI acts with easy undo — maps to specific interface patterns at each level.

At the suggestion level, where AI recommends and users decide, the interface discipline is making accept and reject as fast as possible. Grammarly's single-click accept on an underlined suggestion is the benchmark. GitHub Copilot's Tab to accept sets an even lower bar. The suggestion should require less effort to accept than to dismiss. Products that require two clicks to accept a suggestion are optimizing against adoption.

At the draft level, where AI generates and users edit, the interface must communicate clearly what's AI-generated and what's user-written. Notion AI's visual distinction between AI-generated text and existing document content addresses the attribution problem — users should never be uncertain about what they wrote versus what the model wrote. This matters for trust and for the cognitive load of editing. Editing an AI draft without visual differentiation forces users to mentally track attribution while also evaluating content quality.

At the action level, where AI takes actions with user oversight, the design requirement is reversibility. Every action the AI takes should have a visible undo path that's faster than the action itself. Gmail's undo send — available for several seconds after sending — is the benchmark for how reversibility should feel. An AI that sends emails should have an undo that matches that benchmark. An AI that modifies files should show what changed and offer one-click revert. The technical capability to undo is necessary but not sufficient. The interface must make undoing feel like the natural response to a mistake, not a recovery procedure.

The approval gate pattern for high-stakes actions deserves particular attention. For any AI action where the cost of an error is high and the action is not easily reversible — sending a message to a large list, modifying critical data, executing a financial transaction — requiring explicit user approval before execution is not overhead. It's the design pattern that makes automation trustworthy. The approval screen should show exactly what the AI is about to do, in plain language, with a clear mechanism to review or modify before confirming.


Copilot vs assistant vs agent: three different UX contracts

These three product patterns look superficially similar — they all involve AI responding to user input — but they establish fundamentally different contracts with users, and designing them the same way produces bad experiences.

A copilot stays in the user's workflow. The user is always in control. The AI appears alongside what the user is doing and offers. Grammarly underlining while you write. Copilot suggesting while you code. Smart Compose appearing while you type. The UX contract: I will offer, you will decide. Every suggestion is visually distinct from the user's work. Every suggestion requires explicit acceptance. Nothing changes without user action.

The design principle for copilots: invisibility until relevant. A copilot that constantly surfaces suggestions the user isn't interested in quickly becomes noise. The goal is to be present enough to be useful and absent enough not to be distracting. This requires tuning the threshold of when to surface suggestions — lower threshold for users who accept frequently, higher for users who reject frequently. The acceptance rate data from Post 2 is what drives this calibration.

An assistant responds when asked. The user initiates every interaction. ChatGPT, Claude, Perplexity — the user has a question or task, they submit it, the assistant responds. The UX contract: you ask, I answer. The design challenge is managing the gap between what users expect from a question and what AI can reliably deliver. The interface patterns that matter most: making uncertainty visible in responses, making it easy to ask follow-up questions without starting over, and preserving conversation context across a session so users don't have to repeat themselves.

The conversation depth metric is specific to assistant products. It measures how many turns a user takes in a single session before leaving. Low conversation depth means users are getting their answer in one or two turns and leaving — which sounds good but often means they're not discovering that the assistant can do more. High conversation depth means users are engaging iteratively, which is the pattern that builds the kind of product dependency that drives retention.

An agent acts on behalf of the user. The user delegates a goal rather than asking a question. The agent plans, uses tools, makes decisions, and reports back. The UX contract: you tell me what you want, I'll figure out how to get it done. This is the hardest contract to design for because users are no longer in control of individual steps — they've handed that control to the agent.

The design requirements for agent interfaces: progress visibility (what is the agent doing right now), decision transparency (when the agent is making a judgment call, show what the options were and why this one was chosen), approval gates for consequential actions (covered above), and clear task completion signaling (how does the user know when the agent is done and what the outcome was). An agent interface that doesn't surface what it's doing produces the anxious feeling of having handed your keys to someone and then lost sight of them. Trust requires visibility into the process, not just the outcome.


Multimodal UX: what changes when AI can see

Adding vision, audio, or document understanding to a product changes more than the input methods. It changes the trust architecture, the error handling patterns, and the feedback mechanisms.

The most significant change is in expectation calibration. Users who upload an image and ask a question have a visual reference that they expect the model to share. When the model's response references something the user doesn't see in the image, or misses something obvious, the error is immediately visible in a way that a factual error in text might not be. Multimodal products need to be more explicit about what they can and cannot detect reliably, and the uncertainty communication patterns from earlier in this post become more important, not less.

Document understanding introduces the specific challenge of source transparency. When a user uploads a 50-page PDF and asks a question, the model's answer is grounded in specific passages. Showing which passages informed the response — through highlighting, citation, or page reference — converts an opaque process into a traceable one. Users who can see what the model read are more able to verify what it concluded. Products that show only the conclusion without the source mapping are asking for a higher level of trust than document processing tasks typically warrant.

Audio interfaces, whether voice input or voice output, introduce temporal constraints that visual interfaces don't have. Users can't scroll back through a voice response to re-read something they missed. Products that deliver information primarily through audio need to be more structured in how information is sequenced — most important first, supporting details second — because the user's ability to extract and revisit information is more limited than in text.

The camera as a persistent input changes the interface contract fundamentally. Products that access the camera continuously rather than for discrete snapshots are operating with a different level of ambient awareness. The design responsibility is communicating when the camera is active, what the model is analyzing, and what it might surface — before the user has any reason to worry rather than after they've discovered something unexpected. Ambient AI interfaces that work silently in the background are the most powerful UX pattern in this category and the one that requires the most careful transparency design.


Generative product metrics: what trust actually looks like in the data

Post 2 established the four-layer metrics framework. For generative AI products specifically, Layer 2 (feature engagement) has a distinct set of metrics that most PM dashboards don't track by default.

Edit rate measures what percentage of AI-generated content users modify before using it. Low edit rate means users trust the output enough to use it as-is. High edit rate means the AI is generating drafts that require significant rework — which raises the question of whether the feature is saving time or creating a new editing task. Notion AI tracks this specifically: a high edit rate on document generation signals that the model is producing output that users find useful as a starting point but not as a finished product, which informs where the model needs improvement.

Regeneration rate measures how often users ask the model to try again rather than accepting or editing the output. High regeneration rate is the clearest signal that the model is consistently missing what users want. It's more diagnostic than a simple thumbs-down metric because it tells you the user wanted this task done by AI, just not this way. Tracking regeneration rate by task type reveals which capabilities are reliably useful and which are consistently unsatisfying.

Prompt reformulation rate measures how often users rephrase their request and submit again without accepting any output. Distinct from regeneration — reformulation means the user concluded that the model's failure was due to how they asked, not how the model responded. High reformulation rate signals that the interface isn't helping users understand how to get what they want. This is an interface design problem as much as a model problem. Better prompting guidance, examples of what works, and prompt suggestions reduce reformulation rate without improving the model at all.

Conversation depth in assistant products and task completion rate in agentic products are the Layer 3 metrics that connect engagement to outcomes. A writing assistant where users consistently reach a finished document has a different product-market fit signal than one where users generate a draft, look at it, and close the tab. An agentic research tool where tasks consistently complete without user intervention has a different reliability profile than one where agents regularly stall awaiting clarification.

The metric that connects all of these to business outcomes is sustained usage — specifically the Week 1 versus Week 4 retention pattern covered in Post 2. Generative AI features are particularly susceptible to the novelty effect because generating content is inherently impressive the first time and less impressive the fifth time the output quality doesn't match the use case. Week 4 retention is the clearest signal that the product is delivering ongoing value rather than one-time novelty.


What comes next

You now have the design layer: the interface patterns that make AI feel trustworthy and the metrics that tell you whether it is. The remaining posts cover how to take these products to market, how to think about them strategically when they're being added to existing products rather than built from scratch, what makes them defensible over time, and how to demonstrate the full range of this thinking under interview pressure.

Post 18 covers the go-to-market decisions specific to AI products: how to position when users don't fully understand what the product does, how to price when cost structure is variable and uncertain, and how to build the kind of trust at scale that turns curious first users into committed ones.

More from this blog

A

Applied AI Lab

23 posts

Most AI products fail not because of the model — but product decisions around it.

Frameworks and lessons from shipping AI products at scale. For senior PMs who need to build things that actually work.