Do I need an agent framework or only the Vercel AI SDK?

If your agent is a single tool-calling loop that finishes inside one request, the Vercel AI SDK is enough and a framework is premature. Reach for a framework only when you keep rebuilding the same hard plumbing: durable execution that survives a crash, sandboxed code execution, human approvals, multi-channel or scheduled intake, or evaluation you can run on every change. The deciding question is not whether you have a real agent, but which of those production problems you actually have today.

What problems does an agent framework actually solve?

A framework earns its keep by owning the plumbing around the model call rather than the call itself. The recurring problems are durable execution and resumability, sandboxed code execution, human-in-the-loop approvals, multi-channel and long-running intake, and evaluation plus tracing as a standing need. Eve and Flue both ship these as primitives so you do not hand-build a persistence layer, a sandbox, and an approval queue per project.

Is it bad to build an agent loop from scratch?

No. A hand-rolled loop is a while loop around an LLM call and a tool dispatcher, and for a self-contained task it is the most honest option you can ship. You own retries, persistence, and isolation, which is fine until you need all three at once. The cost is not the loop, it is everything around the loop once the agent has to survive crashes, run untrusted code, or pause for a human.

When should I migrate from the AI SDK to a framework?

Migrate when you hit a concrete plumbing wall, not on a launch announcement. The triggers are specific: a run must resume mid-task after a deploy, the agent must execute code you cannot trust in your own process, a tool call must pause for human sign-off, the agent is triggered by Slack or GitHub instead of one request, or you need to measure quality over time. One wall is a strong signal. Two or more is a clear yes.

What is durable agent execution and why does it matter?

Durable execution means a run can survive a crash, a timeout, or a deploy and resume from where it stopped rather than starting over. It matters because a long agent run that restarts from zero repeats every tool call, doubles the token bill, and can take side effects twice. The Vercel AI SDK keeps state in memory for one request, so durability is exactly the kind of plumbing a framework like Eve or Flue provides and you would otherwise build by hand.

Do You Need an Agent Framework Yet? | The Road To Enterprise

You need an agent framework only when you hit a specific plumbing wall: a run that must survive a crash, code that must run in a sandbox, a tool call that must wait for a human, an agent triggered by Slack instead of a request, or quality you have to measure over time. If your agent is a single tool-calling loop that finishes inside one request, the Vercel AI SDK is the right amount of tool, and a framework is premature abstraction. This is a build-versus-buy decision, and the honest default for most teams is to stay on the AI SDK longer than the launch hype suggests.

This guide targets the Vercel AI SDK 6 (the current stable major, ai@6.x, as of June 2026), with @ai-sdk/anthropic 3.x as the provider. The two frameworks it weighs are Eve (eve@0.11.x, public preview) and Flue (@flue/runtime@1.0.0-beta.1). All three are Apache-2.0.

TL;DR: do you need an agent framework?

Stay on the Vercel AI SDK until you hit a plumbing wall that the SDK does not own. The SDK gives you the tool-calling loop, structured output, streaming, and provider switching, which covers a single agent that does its work inside one request. A framework like Eve or Flue is worth its complexity once you keep rebuilding the same five hard pieces around that loop: durable execution that resumes after a crash, sandboxed code execution, human-in-the-loop approvals, multi-channel or scheduled intake, and evaluation plus tracing you keep running over time. The decision is not philosophical. Count how many of those five you have in production today. Zero means keep your loop. One is a strong signal to adopt deliberately. Two or more means the framework is already cheaper than the plumbing you are about to write by hand.

The three tiers: hand-rolled, the AI SDK, a full framework

Most "do I need a framework" debates skip a tier and turn into a false binary: raw SDK calls versus a heavyweight framework. There are three rungs, and the middle one is where most production agents should live for longer than people expect.

The first rung is the hand-rolled loop. It is a while loop around a model call and a tool dispatcher: you call the model, it asks for a tool, you run the tool, you feed the result back, and you repeat until the model stops asking. Zero dependencies, total control, and you own everything, including retries, persistence, and isolation. For a self-contained task that runs once and returns, this is not a hack. It is the most transparent thing you can ship, and it is worth writing once to understand what the higher rungs are doing for you.

The second rung is the Vercel AI SDK. It gives you the same loop without you maintaining the control flow, plus a typed tool definition, streaming, structured output, and a single call shape across providers. Here is the lower-level form of the loop.

src/agents/support.ts

import { generateText, stepCountIs, tool } from 'ai'
import { anthropic } from '@ai-sdk/anthropic'
import { z } from 'zod'

const lookupOrder = tool({
  description: 'Look up an order by its ID and return its status.',
  inputSchema: z.object({ orderId: z.string() }),
  execute: async ({ orderId }) => getOrderStatus(orderId),
})

export async function answerSupportQuery(question: string) {
  const { text } = await generateText({
    model: anthropic('claude-sonnet-4-6'),
    tools: { lookupOrder },
    stopWhen: stepCountIs(5),
    prompt: question,
  })
  return text
}

The thing that turns a single completion into an agent is stopWhen. Without it, generateText runs one step and returns; with stopWhen: stepCountIs(5), the SDK runs the model, executes any tool it asks for, feeds the result back, and loops until the model produces a final answer or hits five steps. Per the AI SDK loop-control docs, stopWhen also accepts an array, so you can stop on the first condition that matches, for example stopWhen: [stepCountIs(10), hasToolCall('submitAnswer')]. That is the entire agent loop, and you did not write the control flow.

AI SDK 6 also ships a higher-level object form, the ToolLoopAgent class, for when you want to bind the model, tools, and stop condition once and reuse them.

src/agents/support-agent.ts

import { ToolLoopAgent, stepCountIs, tool } from 'ai'
import { anthropic } from '@ai-sdk/anthropic'
import { z } from 'zod'

export const supportAgent = new ToolLoopAgent({
  model: anthropic('claude-sonnet-4-6'),
  tools: {
    lookupOrder: tool({
      description: 'Look up an order by its ID and return its status.',
      inputSchema: z.object({ orderId: z.string() }),
      execute: async ({ orderId }) => getOrderStatus(orderId),
    }),
  },
  stopWhen: stepCountIs(20),
})

const result = await supportAgent.generate({ prompt: 'Where is order 4815?' })

ToolLoopAgent runs the same loop as generateText plus stopWhen, with a default of stopWhen: stepCountIs(20) so it stops after twenty steps if nothing else does. One detail matters for how you reason about the SDK: as the AI SDK agent docs put it, in AI SDK 6 Agent is a type, not a class. There is no hidden runtime doing orchestration behind your back. The agent is your tools, your model, and a stop condition, which is exactly why the SDK stays understandable as the middle rung.

The third rung is a full agent framework. Eve and Flue both wrap the model call in a lifecycle: durable sessions, a sandbox, approval gates, channels that receive events from Slack or GitHub, subagents, and evals. You stop writing the loop and start declaring an agent that the framework runs, persists, and resumes for you. That is a real jump in capability and a real jump in surface area, which is the trade the rest of this guide is about.

The five plumbing walls that actually justify a framework

A framework is not justified by your agent being "agentic enough." It is justified by a specific production problem you have already hit and do not want to solve again by hand. There are five, and each one is a wall in the sense that the AI SDK does not climb it for you. If you have not hit any of them, you do not have a framework problem yet.

The pattern across all five is the same: the model call is the easy part, and the framework earns its place by owning everything around the call. That is the same lesson behind hardening an AI-generated React app for production, where the generated component is fine and the production gap is everything the prototype skipped.

Durable execution and resumability

The first wall is the one teams hit hardest and notice latest. A long agent run holds its state in memory: the conversation so far, which tools have run, what they returned. The Vercel AI SDK keeps that state for the duration of one request, which is correct for a request-shaped agent. When the run is long, that becomes the problem.

Imagine an agent that takes four minutes to triage an incident across six tool calls. A deploy ships at minute three, the serverless instance recycles, and the in-memory state is gone. With a hand-rolled loop or a bare SDK call, the run does not resume. It restarts from the first message, which replays every tool call, doubles the token bill, and can fire any side effect a second time, sending the same Slack message or filing a duplicate ticket. The arithmetic is unforgiving: a run that died on step five of six does not cost you one-sixth extra on retry, it costs you the whole run again, because the conversation has to be rebuilt from message one before the model can continue. That is what durable execution prevents: the run is persisted step by step, so a crash, a timeout, or a deploy resumes from the last completed step instead of from zero.

This is also why the default step limits matter more than they look. A stopWhen: stepCountIs(20) ceiling is a guardrail against a runaway loop, but it is not durability. The agent still has to finish those twenty steps inside one live process. Raise the ceiling for a genuinely long task and you have widened the window in which a single recycle wipes the whole run. The SDK gives you a budget; a framework gives you a resume point.

Both frameworks treat this as a core primitive. Eve describes itself as a "filesystem-first framework for durable AI agents," and Flue builds durability on what it calls Durable Streams, an append-only record so agents "recover from downtime and resume." You can build durability on the AI SDK, but you are now writing a checkpoint store, a resume protocol, and idempotent tool wrappers. That is a framework's job, and rebuilding it is the clearest signal you have outgrown the middle rung.

Sandboxing and human approvals

The next two walls travel together because both are about trust, and both are answered the same way: by a boundary the framework owns rather than one you improvise.

Sandboxing is the wall you hit the moment your agent writes and runs code. An agent that generates a SQL query, a shell command, or a Python script is producing input you did not write, and running that input in your own process is the same mistake as evaluating a user-supplied string. It needs an isolated environment with its own filesystem and its own resource limits, so a runaway loop or a destructive command burns a sandbox instead of your server. Eve ships "sandboxed compute" and Flue exposes a sandbox primitive directly in the agent config (sandbox: local() in its quickstart). Standing up isolated compute yourself, with lifecycle and cleanup, is a project on its own.

Human-in-the-loop approvals are the wall you hit when some tool calls are too consequential to run unattended. Refunding a customer, deleting a record, merging a pull request: these should pause and wait for a person to approve before the agent proceeds. On a bare loop, you would have to suspend execution, persist the pending call, surface it to a human somewhere, and resume on their decision, which is durable execution plus a UI and a queue. Eve lists "human-in-the-loop approvals" as a built-in for exactly this. If you find yourself wiring an approval queue by hand, you are rebuilding a framework feature.

Multi-channel intake, long-running runs, and evals

The last two walls are about how the agent is triggered and how you know it is any good.

Multi-channel and long-running intake is the wall you hit when the agent stops being a function you call and becomes a service that reacts. A request-shaped agent answers one HTTP call, but a real operations agent is triggered by a Slack message, a GitHub event, a webhook, or a schedule, and it may run in the background for minutes. Both frameworks make channels a primitive: Eve names "Slack, Discord, Teams, Telegram, Twilio, GitHub, and Linear," and Flue describes channels that "connect your agents to external sources like Slack, GitHub, Linear, and more." You can wire one webhook to an SDK call without a framework. Wiring several, with verification and a queue and background execution, is the plumbing the framework already owns.

Evaluation and tracing as a standing need is the quietest wall and the one that decides whether the agent stays good. Shipping an agent is not the same as knowing it works next week, after a prompt tweak or a model swap silently changes its behavior. You need to trace each run and score it against a fixed set of cases over time. Eve lists "Evals" as a built-in. This is the same component-level discipline that separates a demo from a system, and it is where teams that skip measurement get surprised. A framework with tracing and evals built in turns "it seemed fine" into a number you can watch.

When the Vercel AI SDK is still the right amount of tool

Here is the part the launch posts skip. For a large share of real features, the AI SDK is not a stepping stone to a framework. It is the destination.

The AI SDK is the right amount of tool when your agent does its work inside one request and returns. A support bot that looks up an order and answers, a code reviewer that reads a diff and posts comments in the same run, an extraction agent that calls two tools and returns structured JSON: none of these need durable execution, because the run is short and a retry is cheap. None need a sandbox, because they call typed tools you wrote, not arbitrary generated code. None need approvals, channels, or a standing eval harness to ship their first version. Adding a framework here does not buy resilience you need; it buys surface area you have to learn, configure, and keep updated against a beta.

There is also a model-portability reason to stay on the SDK as long as it fits. The SDK keeps the provider behind one call shape, so swapping @ai-sdk/anthropic for another provider is a model string, not a rewrite. That is the same instinct as keeping a model-agnostic fallback layer so one provider decision cannot take your feature down. A framework adds its own opinions about models, deployment, and runtime on top of that, and those opinions are exactly what you do not want to take on before you need them.

The trap to avoid is treating the framework as the "serious" choice and the SDK as the toy. It is the reverse of the prototype-versus-production split in vibe coding versus production coding: there, the danger is shipping a prototype as if it were production. Here, the danger is adopting production-grade infrastructure for a prototype-grade problem. Both are the same mistake, which is matching the tool to the hype instead of to the problem in front of you.

What a framework gives you, and what it costs

When you do hit a wall, a framework is a genuine accelerant. The benefit is concrete: you stop writing and maintaining the persistence layer, the sandbox lifecycle, the approval queue, the channel adapters, and the eval harness, and you get them as declared primitives. For a team running several long-lived agents, that is months of plumbing you do not own. Both Eve and Flue are Apache-2.0, so the framework itself is not a license cost.

The cost is real, and it is not the license. Start with conceptual surface area. You trade a while loop you can read top to bottom for a lifecycle you have to learn: how sessions persist, how the sandbox is provisioned, how a run resumes, how a channel event maps to an invocation. When something misbehaves, you are now debugging the framework's model of the world, not just your own code.

Then there is maturity. As of June 2026, Eve is in public preview and Flue is at 1.0.0-beta.1, and both say their APIs may change before a stable release. Building a core workflow on a moving API means absorbing breaking changes on the framework's schedule, not yours. That is a fair trade when the plumbing it replaces is harder than the upgrade churn, and a bad trade when you adopted it for a problem you did not have.

The last cost is deployment posture, and it is where the two frameworks diverge most. Eve deploys to Vercel today, with other platforms described as on the way, so it is the lighter lift if you are already on Vercel and a constraint if you are not. Flue is built to be runtime-agnostic, deploying to Node, Cloudflare Workers, GitHub Actions, and more, which trades a turnkey path for portability. That difference is a deciding factor once you adopt, and it deserves its own comparison rather than a paragraph here.

If you adopt one: Eve or Flue?

If you have decided a framework earns its place, the next question is which one, and the short answer is that it comes down to deployment posture, not feature checklists. Both cover the five walls; they differ on where they run and how opinionated they are.

The quick cut is this. Eve is Vercel-first and turnkey if that is already your platform, using a file-based project (npx eve@latest init) with defineAgent and defineTool. Flue is runtime-agnostic and deploys across Node, Cloudflare Workers, and CI runners, with a programmatic createAgent API and skills defined as SKILL.md files. If you live on Vercel and want the fastest path, Eve fits; if you need to run on your own runtime or inside CI, Flue is built for that.

That is the headline, but the real decision turns on details that do not fit here: the exact API shapes, how each handles sandboxing and channels, the maturity gap between a public preview and a 1.0 beta, and how each one's deployment story affects your stack. I work through all of that in the companion comparison, Eve vs Flue. Read that one before you commit to either.

A decision checklist before you add the dependency

Run this before you adopt any agent framework. It is built to be answered honestly about today, not about a roadmap.

Does a run need to survive a crash, a timeout, or a deploy and resume mid-task, rather than restart from the first message? If a restart would replay tool calls or double side effects, you have the durability wall.
Does the agent write and run code, queries, or commands you did not author, such that running it in your own process is unsafe? If yes, you need a sandbox, not a try-catch.
Must any tool call pause for a human to approve it before the agent continues? An approval gate is durable execution plus a queue and a UI, and it is a framework feature.
Is the agent triggered by something other than one request, like a Slack message, a GitHub event, a webhook, or a schedule, and does it run in the background? Multi-channel and long-running intake is plumbing the SDK does not own.
Do you need to trace and score agent quality over time against fixed cases, not just ship it once? A standing eval and tracing need is the quiet wall that keeps the agent good.
How many of the five above are true today, not hypothetically? Zero means stay on the AI SDK. One is a deliberate, eyes-open adoption. Two or more means the framework is already cheaper than the plumbing.
If you adopt, can you absorb breaking changes from a preview or beta API on the framework's schedule? If a moving API is a dealbreaker, wait for a stable release or keep the loop.

The same instinct applies when you bring an agent into an existing system: understand what is already there before you bolt on infrastructure, the way you would when onboarding to a new codebase with AI tools. This list is deliberately conservative because the failure mode in mid-2026 is adopting too early, not too late.

Conclusion

Eve's launch and Flue's 1.0 beta landing in the same week make "adopt a framework" feel like the obvious next step, and for most teams already shipping on the Vercel AI SDK, it is not. A framework is not a maturity badge. It is a specific answer to five specific problems: durable execution, sandboxing, approvals, multi-channel intake, and evals. If you have one of those problems, adopt deliberately and choose Eve or Flue by where you need to deploy. If you have none of them, the SDK's tool loop is not a stepping stone you will outgrow next quarter; it is the right amount of tool, and the most senior move is to keep it until a real wall forces your hand.

Do You Need an Agent Framework Yet?

Written by Thomas Findlay.