Skip to main content

Command Palette

Search for a command to run...

Edge, Cloud, and the AI Stack Every PM Should Understand

Updated
12 min read
Edge, Cloud, and the AI Stack Every PM Should Understand
R
Senior Product Manager writing about two sides of AI: building AI products that work at scale, and using AI to work more effectively as a PM. I share frameworks for Applied AI product management—economics, evaluation, agent design, responsible deployment—alongside practical guides for AI-powered productivity, workflows, and decision-making. If you're building AI products or figuring out how to leverage AI in your PM workflow (or both), this is for you. Currently based in Seattle.

Part 7 of the Applied AI Product Management. In Part 6, we built the framework for choosing your AI approach, your model, and how to serve it. Now we go one layer deeper: the infrastructure decisions that are being made on your behalf every day, and what they mean for your product twelve months from now.


A consumer app PM discovered six months after launch that her team had chosen a vector database that billed on read units rather than storage. When the product took off and queries scaled, the costs scaled non-linearly in a way nobody had modeled. The database bill went from $800 a month to $11,000 in eight weeks. The migration to a different database took two engineers three weeks.

Nobody made a bad decision. The engineers picked a reasonable tool at reasonable early-stage scale. But nobody had asked the question that would have surfaced the risk: how does this cost behave when we succeed?

That's the PM's role in infrastructure decisions not to make them, but to ask the questions that prevent the ones nobody will be able to undo later.


The stack in five layers

Before getting into specific decisions, it helps to have a mental model of where each piece lives. There are five layers in a typical AI product stack, and each one generates a set of decisions with different cost, flexibility, and lock-in profiles.

The foundation model layer is what you're calling: GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, or a self-hosted open-source model. Post 6 covered this in depth. It's the most visible layer but not always the most consequential for long-term cost.

The orchestration layer is the framework that connects your model to data sources, tools, and multi-step logic: LangChain, LlamaIndex, or custom code. It determines how your prompts get assembled, how retrieval works, and how complex workflows get managed.

The vector database layer stores the embeddings that power semantic search and RAG retrieval: Pinecone, Weaviate, Qdrant, pgvector. It determines retrieval quality, latency, and cost at scale.

The application layer is your product: the UI, the business logic, the user-facing experience built on top of everything below it.

The observability layer is monitoring, tracing, and evaluation infrastructure: LangSmith, Weights & Biases, or custom dashboards. It determines whether you can see what's happening when something goes wrong.

PMs tend to focus on the application layer and the foundation model layer because those are the most visible. The middle three layers (orchestration, vector storage, and observability) are where the decisions that lock you in actually live.


Vector databases: the decision that surprises teams at scale

A vector database stores high-dimensional numerical representations of your content: documents, product descriptions, support articles, code, so that semantic similarity search can find the most relevant chunks to include in a prompt. It's the retrieval engine that makes RAG work.

The landscape has matured significantly. In 2026, all four major contenders have production-grade offerings, but they have diverged in architecture, pricing, and sweet spots. The choice is no longer as simple as "managed vs. self-hosted."

Pinecone wins fastest time-to-production. It's fully managed, usage-billed, gets you to production today with no infrastructure work, and predictable performance without index tuning. The tradeoff: costs scale non-linearly with vector count and query volume above ten million vectors, and your data lives in Pinecone's infrastructure with no self-hosted option. The PM risk is exactly the one from the opening story: a product that succeeds unexpectedly can face a bill that triples before anyone notices.

Weaviate wins hybrid search and multi-tenant deployments. It supports keyword and vector search in a single query, essential for products where users need to find things by both meaning and exact match. It's open-source and free to self-host, with a managed cloud option. The tradeoff is operational complexity; self-hosting Weaviate at production scale requires Kubernetes experience your team may not have.

Qdrant wins raw query speed and filtered search. Written in Rust, it benchmarks at the highest queries-per-second of the major options for filtered workloads, when users need to find semantically similar items that also match specific metadata criteria (find documents similar to this query where department = legal). It's self-hostable or available as managed cloud. The tradeoff is a smaller ecosystem than Pinecone or Weaviate.

pgvector wins when you already run Postgres and have under ten million vectors. It adds vector search to your existing database with no new infrastructure, no sync layer, and no separate service to manage. pgvector is no longer "the slow option"; pgvectorscale delivers competitive query performance, making it a legitimate production choice for most teams under that scale threshold. The tradeoff is a hard ceiling: above fifty million vectors, purpose-built engines pull significantly ahead.

The decision rule for PMs: start by asking how many vectors you'll realistically have in twelve months and what your query volume looks like at target scale. Under ten million vectors, pgvector and Chroma win on raw latency because queries stay on localhost with no network hop. Between ten and fifty million, Qdrant or Weaviate depending on whether filtered search or hybrid search matters more. Above fifty million, you need a purpose-built engine with sharding capabilities, and you need to have budgeted for it.

The cost question to ask before committing: "Show me what this database costs at 10x our current vector count and 10x our current query volume." If nobody has run that model, run it before you sign anything.


Orchestration: where complexity compounds

Orchestration frameworks handle the connective tissue between your model and everything else: document loaders, vector stores, APIs, memory, multi-step workflows. They determine how a user's question becomes a retrieved context becomes a prompt becomes a response.

In 2026, teams rarely deploy raw models. They use frameworks to manage chains, memory, retrieval pipelines, and evaluation. The two dominant options are LangChain/LangGraph and LlamaIndex, and understanding their different strengths is more useful than picking one as "better."

LangChain is the most widely adopted LLM framework with over 100,000 GitHub stars and 34.5 million monthly downloads. Its strength is ecosystem breadth, i.e. integrations for nearly every vector store, document loader, and model provider. LangGraph, LangChain's agent orchestration layer, stabilized at version 1.0 in October 2025 and is now the recommended path for building stateful, multi-step agent workflows. LangGraph's built-in state persistence is particularly valuable for agents that need to pause for human approval, maintain conversation memory across sessions, or resume long-running tasks after interruption.

LlamaIndex is a data framework built specifically to make retrieval-augmented generation simpler to ship and operate. Its retrieval primitives: hierarchical chunking, auto-merging, sub-question decomposition, produce better retrieval quality with less tuning than LangChain's component-based approach for document-heavy applications. It has approximately six milliseconds of framework overhead versus LangGraph's fourteen milliseconds, and lower token usage per call. These are differences that compound at high query volume.

The PM-level insight: many production teams use LlamaIndex for ingestion and indexing while layering LangChain or LangGraph on top for orchestration. This combination is often the fastest route to a robust RAG system. LlamaIndex handles the data layer, LangGraph handles the workflow layer. The question isn't which framework is better, it's which layer of your product has the highest stakes for quality, and whether your team has the expertise to operate both.

The red flag to watch for: stacking three or more frameworks usually signals over-engineering. Each framework adds dependency risk, upgrade friction, and cognitive overhead. If your engineers are evaluating a fourth framework for a problem the first three could solve, the answer is almost always to go deeper on what you have, not broader.

The alternative worth knowing: direct API calls without a framework are underrated for simple use cases. A single-step RAG pipeline, retrieve, format prompt, call model, return response, often doesn't need a framework at all. Frameworks earn their place when you have multi-step workflows, complex state management, or enough integrations that the abstractions save more time than they cost. For a proof-of-concept or an early MVP, they can add more complexity than value.


Edge vs. cloud: a product decision with privacy and latency consequences

Edge deployment means running the model on the user's device. Cloud deployment means running it on your servers. Most teams default to cloud without evaluating whether that's actually right for their product, and the consequences can be significant in both directions.

Cloud is the right default when your models are large, you need frequent updates, your users tolerate 100ms or more of latency, and privacy isn't a differentiating concern. Every frontier model API call is cloud deployment. ChatGPT, Perplexity, most enterprise AI features are all cloud. The benefits are obvious: access to the most powerful models, no device constraints, easy updates, personalization from aggregate usage data.

Edge is the right answer when any of three conditions apply. First, latency requirements are under fifty milliseconds, a threshold that network round trips make impossible to hit reliably from cloud. Gmail Smart Compose moved to on-device inference in 2019 precisely because suggestions needed to appear in under 100 milliseconds, and network latency made that unreliable. Second, privacy is a genuine product differentiator. Users whose data never leaves their device have a qualitatively different trust relationship with the product. Apple's keyboard predictions, Face ID, and health features all run on-device not because of technical necessity but because privacy is a product value. Third, offline functionality is required. Tesla's Autopilot cannot depend on network connectivity at highway speeds. Everything safety-critical runs locally.

The tradeoffs are real and worth quantifying: edge models must be small enough to run on device (typically under 100MB for mobile, under 500MB for desktop), updates require app releases rather than server-side changes, and you lose the ability to personalize from aggregate user behavior. The quality ceiling is substantially lower than cloud.

The hybrid pattern that increasingly makes sense: edge for the latency-sensitive fast path, cloud for the quality-sensitive heavy lifting. Apple Siri uses on-device models for wake word detection and simple queries, routing to cloud only for complex requests. Grammarly runs basic grammar checks on-edge for speed, reserving cloud for advanced style and tone analysis. The decision framework: map your feature's latency requirement against your users' privacy sensitivity, then check whether an on-device model can meet your quality bar for that feature.


Serverless vs. managed vs. self-hosted

This is the operational model decision: not where computation happens, but how your team relates to the infrastructure running it.

Serverless (AWS Lambda, cloud functions) means you pay per invocation with no always-on infrastructure. It's the right choice for bursty or unpredictable workloads, batch processing jobs, and features where usage patterns are hard to forecast. The cost at low volume is genuinely cheap. The cost at high volume can be surprising, and the cold start latency (100–1000ms when a function spins up) makes it unsuitable for real-time user-facing features.

Managed (AWS Bedrock, Azure OpenAI, Google Vertex AI) means a cloud provider handles the infrastructure and you consume it as a service. This is the right default for most production AI applications; you get predictable performance, enterprise compliance features, SLA guarantees, and no infrastructure operations work. The tradeoff is cost premium and some constraints on customization.

Self-hosted means running models on your own compute: EC2 instances, Kubernetes clusters, bare metal. This is justified in three situations: privacy or data residency requirements that prohibit sending inputs to external providers; query volumes high enough that the economics favor ownership over rental; or fine-tuning control that managed providers don't offer. The operational overhead is substantial and often underestimated. Model serving, scaling, failover, monitoring, and updates all become your team's responsibility.

The PM decision rule: start managed and migrate to self-hosted only when you can quantify the cost savings or compliance requirements that justify taking on the operational complexity. The teams that self-host too early spend engineering cycles on infrastructure that could have gone into product. The teams that self-host too late face a migration under cost pressure, which is always worse than a migration you chose.


The observability gap

Most teams instrument the application layer thoroughly and the AI layer almost not at all. This is the gap that turns minor model issues into production incidents.

Observability for AI products means three things beyond standard application monitoring: tracing individual inference calls end-to-end (what prompt went in, what came out, how long it took, what it cost), tracking quality metrics over time (are outputs getting better or worse as prompts change), and catching distribution shifts before they degrade user experience.

LangSmith provides tracing natively for LangChain/LangGraph applications. Weights & Biases handles model tracking and experiment logging. For teams not using a framework, the minimum viable observability setup is logging every prompt, every output, the model version, the latency, and the cost per call, and then building a dashboard that surfaces anomalies in those signals.

The PM question to ask before launch: "If the model starts producing worse outputs tomorrow, how long before we notice?" If the honest answer is "when users complain," your observability is insufficient.


What comes next

You now have the full picture of the infrastructure stack: what each layer does, how the decisions compound, and what to ask before your team commits to any of them.

Infrastructure choices define the constraints you operate within. The next layer of PM thinking is about making the best decisions within those constraints. Specifically, the tradeoffs that appear in every AI product review: how to balance accuracy against latency against cost, when false positives matter more than false negatives, how much automation is actually safe to give users, and when model complexity starts working against you.

In the next post, we'll build the tradeoffs framework that ties everything together the mental models that let you make principled decisions when perfect options don't exist and every choice costs something.