If you are deciding between agentic RAG and a static retrieve-then-generate pipeline, start static. A fixed pipeline, or a cheap router in front of it, answers most queries at a fraction of the cost and well under a second, while an agent loop multiplies model calls and seconds per query. Graduate to a single agent with tools only when evaluation shows a fixed path failing. Agency is a cost you justify with evidence.
This targets a TypeScript stack: Next.js or Nuxt server routes calling a vector store and a model SDK. Code examples use the Vercel AI SDK 6 as of June 2026, with current Claude model ids claude-opus-4-8, claude-sonnet-4-6, and claude-haiku-4-5-20251001.
TL;DR: which RAG architecture should you pick?
Default to a static retrieve-rerank-generate pipeline, because it is one fixed set of model calls and it answers the majority of product questions in well under a second. When a minority of queries need a different path, add a cheap router: one small classification call that decides where the query goes, no open-ended loop. Reach for a single agent with tools only when component-level evaluation proves a fixed path is failing on a real slice of traffic, such as multi-hop questions or retrieval that misses on the first try. Reserve multi-agent designs for when a single agent demonstrably cannot cope, because every agent you add multiplies the per-query cost and the failure surface. The progression is static, then routing, then one agent, then many, and you only move to the next rung when the current one provably breaks.
How agentic RAG and static RAG actually differ
The word "agentic" hides the only difference that matters for your bill and your p95 latency: who decides what happens next. In a static pipeline, your code decides. The sequence is fixed at build time: embed the query, fetch the top matches, hand them to the model, return the answer. The model is called once, and it does one job, which is to write the answer from the context you already chose.
In an agentic pipeline, the model decides. You hand it a set of tools, a search tool, maybe a second index, maybe a calculator, and let it run a reason-act-observe loop. It picks a tool, reads the result, and either calls another tool or writes the final answer. That loop is the whole point and the whole cost. Each turn around it is another model call, and the model, not you, chooses how many turns it takes.
Here is the comparison that decides most projects. Treat the numbers as reported ranges from third-party write-ups rather than guarantees, then measure your own.
| Pattern | Typical latency | Relative cost per query | Model calls per query | Use it when |
|---|---|---|---|---|
| Static pipeline | sub-second | baseline (1x) | 1 generation + 1 embedding | Most factual lookups over your own docs |
| Router + static | sub-second plus a small hop | ~1.1x to 1.5x | + 1 cheap classify call | A minority of queries need a different path or no retrieval |
| Single agent with tools | a few seconds | several x | 1 per step, often 3 to 6 | Multi-hop questions or retrieval that fails on the first pass |
| Multi-agent | several to many seconds | highest | many, each agent loops | A single agent provably cannot decompose the task |
Reported figures from 2026 RAG write-ups put naive pipelines in the low hundreds of milliseconds and agentic loops in the multi-second range at several times the cost, and some retrieval studies report that a single naive retrieval pass misses the needed context on a meaningful share of harder queries. The exact multipliers depend on your model, your chunking, and your traffic, so the table is a shape to reason with, not a benchmark to quote. The pattern holds regardless of the numbers: agency buys flexibility and charges you in calls.
Why a static pipeline is the right default
Most teams reach for an agent loop far too early, usually because the tutorials they copied did. The honest starting point for retrieval over your own documents is a pipeline with no agency at all, and it clears the bar for a surprising share of real queries: "what does our refund policy say," "which plan includes SSO," "summarize this ticket thread." None of those need a model to plan. They need the right three chunks and one good generation.
A static pipeline is also the only version you can reason about cleanly. The cost is fixed, the latency is fixed, and when it returns a bad answer you know exactly where to look, because there are only two moving parts: what got retrieved, and what the model did with it. That is worth more than it sounds when you are debugging in production at 2am.
Let's have a look at the whole thing. The Vercel AI SDK ships the retrieval primitives you need, embed for the query vector and cosineSimilarity for ranking, so the pipeline is small.
src/ai/static-rag.ts
import { embed, cosineSimilarity, generateText } from 'ai'
import { openai } from '@ai-sdk/openai'
import { anthropic } from '@ai-sdk/anthropic'
interface Chunk {
text: string
embedding: number[]
}
export async function answerFromDocs(question: string, chunks: Chunk[]): Promise<string> {
const { embedding } = await embed({
model: openai.embedding('text-embedding-3-small'),
value: question,
})
const context = chunks
.map((chunk) => ({ chunk, score: cosineSimilarity(embedding, chunk.embedding) }))
.sort((a, b) => b.score - a.score)
.slice(0, 3)
.map((match) => match.chunk.text)
.join('\n\n')
const { text } = await generateText({
model: anthropic('claude-sonnet-4-6'),
system: 'Answer only from the provided context. If the answer is not in it, say you do not know.',
prompt: `Context:\n${context}\n\nQuestion: ${question}`,
maxOutputTokens: 512,
})
return text
}
The shape of this function is the argument. There is no loop, no tool, no decision point the model controls, so the path is the same for every query and so is the price: one embedding call and one generation. The system instruction that pins the model to the supplied context is doing the quiet work of keeping the model from answering from training data, which is the failure that makes a "working" demo lie to users. In a real app the chunks come from a vector store query rather than an in-memory array, but the control flow is exactly this: retrieve, then generate, and stop.
What this pipeline cannot do is notice that it retrieved the wrong thing. If the question needs two facts that live in two documents, the single retrieval pass grabs whichever is closer in vector space and the model answers from half the picture. Hold that limitation. It is the one that eventually justifies more machinery, and recognizing it in your own traffic is the trigger, not a blog post telling you agents are the future.
When does query routing capture most of the value?
Between the static pipeline and a full agent sits the option most teams skip, and it is usually the one they actually need. A router does not give the model an open-ended loop. It asks the model one narrow question first, what kind of query is this, and uses that single answer to pick a fixed path. The model gets a vote on routing, not on the whole control flow.
The case for routing is that real traffic is not uniform. A chunk of it needs no retrieval at all (a greeting, a "thanks, that worked"), another chunk belongs to a different index (billing questions versus product-docs questions), and the rest is the ordinary case your static pipeline already handles. Running full retrieval on a greeting is waste, and routing a billing question into the product-docs index is a wrong answer waiting to happen. One cheap classification call fixes both.
src/ai/router.ts
import { generateText } from 'ai'
import { anthropic } from '@ai-sdk/anthropic'
import { answerFromDocs, type Chunk } from './static-rag'
type Route = 'product_docs' | 'billing' | 'no_retrieval'
async function classify(question: string): Promise<Route> {
const { text } = await generateText({
model: anthropic('claude-haiku-4-5-20251001'),
system:
'Classify the user question into exactly one label: product_docs, billing, or no_retrieval. Reply with the label only.',
prompt: question,
maxOutputTokens: 8,
})
const label = text.trim() as Route
return label === 'billing' || label === 'no_retrieval' ? label : 'product_docs'
}
export async function routeAndAnswer(
question: string,
indexes: { product_docs: Chunk[]; billing: Chunk[] },
): Promise<string> {
const route = await classify(question)
if (route === 'no_retrieval') {
return 'Happy to help. Ask a question about the product or your account.'
}
return answerFromDocs(question, indexes[route])
}
Two details carry the design. The classifier runs on the cheapest model in your fleet, Haiku here, with maxOutputTokens clamped to a handful so the call is fast and nearly free, because its only job is to emit one label. And the fallback in the last line of classify matters more than the happy path: a router that throws or returns garbage on an unexpected label is worse than no router, so an unrecognized response defaults to the safe, most-common route rather than failing the request. Routing buys you most of the adaptivity people want from an agent, for the price of one small call and zero loops. For many products this is the last rung you ever need.
When does a single agent with tools earn its keep?
The honest trigger for an agent is not a feature request. It is an eval that fails. When you can point at a slice of real queries where the static pipeline returns wrong or incomplete answers, and you can see the cause is retrieval, the first pass grabbed the wrong chunks, or the question needs two hops the pipeline cannot make, then a controller that can retrieve, look at what came back, and retrieve again starts to pay for itself. The classic shape is multi-hop: "which of our enterprise customers signed up before the SSO feature shipped" needs the SSO ship date and then a second lookup filtered by it. One retrieval cannot express that. A loop can.
The danger is that an agent loop is also the easiest thing to ship without guardrails, and an unguarded loop is a budget leak and a latency cliff. So the version worth showing is the guarded one. In the Vercel AI SDK, generateText runs a multi-step tool loop when you give it tools and a stop condition via stopWhen, and you compose stop conditions in an array so the loop ends on whichever fires first.
src/ai/agent-rag.ts
import { generateText, tool, stepCountIs, hasToolCall } from 'ai'
import { anthropic } from '@ai-sdk/anthropic'
import { z } from 'zod'
import { searchIndex } from './search'
export async function agentAnswer(question: string): Promise<string> {
const controller = new AbortController()
const timeout = setTimeout(() => controller.abort(), 15_000)
try {
const result = await generateText({
model: anthropic('claude-sonnet-4-6'),
system:
'Use the search tool to gather context before answering. Search at most a few times, then call submitAnswer with your final answer grounded in what you found.',
prompt: question,
tools: {
search: tool({
description: 'Search the product documentation for relevant passages.',
inputSchema: z.object({ query: z.string().describe('A focused search query') }),
execute: async ({ query }) => searchIndex(query),
}),
submitAnswer: tool({
description: 'Submit the final answer once enough context has been gathered.',
inputSchema: z.object({ answer: z.string() }),
execute: async ({ answer }) => answer,
}),
},
stopWhen: [stepCountIs(5), hasToolCall('submitAnswer')],
maxOutputTokens: 1024,
abortSignal: controller.signal,
})
const submitted = result.steps
.flatMap((step) => step.toolResults)
.find((entry) => entry.toolName === 'submitAnswer')
return typeof submitted?.output === 'string' ? submitted.output : result.text
} finally {
clearTimeout(timeout)
}
}
Every line that is not the model call is a guardrail, and that ratio is the point. The stopWhen array caps the loop two ways at once: stepCountIs(5) stops a runaway after five steps no matter what, and hasToolCall('submitAnswer') stops it the instant the agent commits to an answer, so a well-behaved query exits early instead of burning its full budget. The AbortController with a 15-second timeout fed to abortSignal is the wall-clock backstop for a step that hangs, because a step cap does nothing if one tool call never returns. And reading the answer back out of result.steps rather than trusting result.text reflects how the loop actually ends: when the agent finishes by calling a tool, the final text may be empty, so you pull the submitted answer from the tool results and fall back to the text only if it is missing.
The SDK also offers a higher-level ToolLoopAgent class in version 6 that wraps this same loop, which is worth adopting once you have several agents that share a model and toolset. I am showing the explicit generateText form here on purpose, because when you are deciding whether you even need an agent, you want every guardrail visible in the code, not folded into a class. The cost is real: this single query can be three to six model calls instead of one, so it has to earn that by answering questions the static pipeline measurably cannot.
What guardrails does every agent loop need?
An agent loop fails differently from a static pipeline, and the failures are expensive: it loops too long, it spends too much, or it hangs on a tool that never returns. None of those can happen to a fixed pipeline, because a fixed pipeline has no loop to run away. Before any agent touches a paying user, it needs the same controls you would put on any process that calls a flaky dependency in a loop.
- Cap the steps. Pass
stopWhena hard ceiling likestepCountIs(5)so a confused model cannot loop indefinitely, and pair it withhasToolCall('submitAnswer')so a confident model can finish early. The cap protects your budget; the early exit protects your latency. - Set a wall-clock timeout. A step cap does nothing if one tool call hangs, so wrap the run in an
AbortControllerand pass its signal toabortSignal. Decide the deadline from your p95 budget, not from a round number. - Fall back to the fixed path. When the agent aborts, errors, or exhausts its steps without an answer, degrade to the static pipeline instead of returning an error. The agent is an optimization on top of a working baseline, and the baseline is your safety net.
- Bound retries. The SDK retries failed calls (
maxRetriesdefaults to 2), which interacts with your loop and your timeout, so set it deliberately rather than letting retries stack on top of steps and inflate both cost and latency. - Log the trail. Record the steps, the tool calls, and the token usage from
result.stepsandresult.totalUsageon every request, because an agent that silently took six steps when it should have taken two is a cost regression you can only catch if you measured it.
The instinct here is the same one behind building an LLM fallback layer before your model vanishes: treat the model and its loop as an unreliable dependency and put the controls in your own code, where you can reason about them. An agent without these guardrails is not a smarter pipeline. It is an unbounded one.
A decision checklist you can apply this week
You do not need to choose the architecture upfront. You need to choose the next rung, ship it, and let evidence pull you up the ladder. Run this against the RAG feature you are building right now.
- Start with the static pipeline. Embed, retrieve, generate, stop. Ship it and watch real queries before you add anything.
- Build a small eval set from real or realistic questions, with the answer you expect. Without it, every "the agent is better" claim is a vibe, and you will have caught the same vibe-driven shortcuts the AI code review checklist for React and Vue is built to catch.
- Add routing when, and only when, you see distinct query types that need different paths or no retrieval. One cheap classify call, with a safe default on an unexpected label.
- Graduate to a single agent only when the eval shows a fixed path failing on a real slice, and the cause is retrieval that needs more than one pass. The failing eval is the permission slip.
- When you do add an agent, add the guardrails in the same commit: step cap, tool-call stop, timeout, fallback to static. Treat the loop as a runtime dependency and test the agent path the way you test any AI-generated code, against the eval, not by eyeballing one happy-path answer.
- Hold multi-agent in reserve. Only split into multiple agents when a single agent provably cannot decompose the task, because you are multiplying cost and failure modes to buy it.
The thread through all of it: each rung adds a model call and a way to fail, so each rung needs a reason that shows up in your evals, not in a benchmark someone else ran.
Conclusion
The trap is treating agency as a feature you turn on, when it is a cost you take on. A static pipeline is not the beginner version of RAG that you graduate out of. For a large share of real product questions it is the correct, finished answer, and the teams that ship it and measure it before reaching for a loop are not behind, they are spending their latency and their token budget where it actually buys something. Routing earns its place when traffic splits into paths. A single agent earns its place when a fixed path provably fails. Many agents earn their place almost never.
So build the smallest thing that answers your queries, instrument it so the next rung announces itself, and let the failing eval, not the hype cycle, be the thing that tells you it is time to add agency. The architecture should follow the evidence. It should never lead it.


