Multi-Agent Systems: When One Agent Isn't Enough

A team showed me their architecture last month and it had nine agents in it. A planner, a researcher, a critic, a writer, a fact-checker, a router, two specialists, and one whose job I never fully understood. The task was: read a document and pull out the obligations. One model with one good prompt does that. They had built a small bureaucracy to summarize a PDF, and like every bureaucracy, most of its energy went into the agents talking to each other.

This is the dominant failure I see right now. Multi-agent is the architecture people reach for because it looks like seniority. More boxes, more arrows, more of the diagram that gets you nodded at in a review. The honest default is the other direction. Fewer agents. Usually one. Sometimes none.

What an “agent” actually buys you

An agent is a loop: the model picks a tool, sees the result, decides what to do next, repeats until it thinks it is done. That loop is the expensive part. Every hop is another model call, another chance to misread the last step, another stretch of latency, another slice of the token bill. One agent is already a stochastic thing wrapped around a control flow you do not fully own.

Now run several of them and have them hand work to each other. You have not added intelligence. You have added a distributed system, and a flaky one, where the messages between nodes are written in English by a model that hallucinates. Every handoff is a place a fact gets dropped, a constraint gets softened, or a task quietly mutates into a slightly different task. The supervisor asks for the top three risks; the specialist returns five, two of them invented; the critic blesses them because they read well. Nobody lied. The information just decayed at every border.

Here is the part nobody tells you. Most multi-agent systems are not solving a multi-agent problem. They are one agent’s job that somebody chopped into stages because chopping felt like engineering. And the moment you chop, you sign up for the orchestration tax: serialization between steps, a protocol for who talks to whom, retries when a sub-agent returns garbage, and the debugging session from hell when the output is wrong and you have five transcripts to read instead of one.

The cases where it genuinely pays

I am not against multiple agents. I have shipped them. But the bar is specific, and there are only three reasons I have ever found that hold up.

The first is real parallelism. Not “these steps could run side by side in principle,” but “these subtasks do not depend on each other and the wall-clock time matters.” Fan out a research question across forty sources at once, let each branch read independently, fold the results back. That is a genuine win, because the work is actually independent and you are buying speed you cannot get from one sequential loop. The test is brutal and simple: if subtask B needs anything subtask A produced, they are not parallel, and you are kidding yourself.

The second is separation of privilege. This is the one I care most about, because it is a security argument, not a performance one. The agent that reads from the production database should not be the same agent that can post to a customer. The agent that drafts a refund should not be the agent that approves it. Splitting here is not about making the model smarter. It is about making the blast radius small. A compromised or confused agent should be able to touch exactly one bounded set of tools, and no more. I will split for this even when a single agent would technically work, because the failure I am designing against is not a wrong answer, it is a wrong action on money or data.

The third is genuinely distinct context. When two subtasks need different tools, different system prompts, and different reference material, and stuffing all of it into one context window makes the model worse at both, a clean split helps. A coding sub-agent with its repo tools and a compliance sub-agent with its policy corpus do not want to share a brain. Past a point, more context in one window is not richer, it is noisier, and the model starts confusing one job’s instructions for the other’s.

Notice what is not on that list. “It feels more organized.” “Each agent has a clear role.” “It mirrors how a human team works.” Those are aesthetics. An org chart is a coordination cost humans pay because one human cannot hold everything. A model can hold a lot in one prompt. Do not pay a coordination cost to solve a problem you do not have.

Supervisor and pipeline, and the complexity to refuse

When you do split, two shapes cover almost everything.

A pipeline is a fixed sequence: stage one’s output is stage two’s input, the path never branches, and you control the routing in plain code, not by asking a model where to go next. A supervisor is one orchestrator that holds the goal and delegates to specialists, deciding at runtime who gets the next piece. Pipelines are predictable and cheap to debug. Supervisors are flexible and cost more, because the supervisor itself is a model call in the loop, reasoning about delegation every turn.

A supervisor delegating to specialized agents, beside the single-agent alternative that does the same job with tools

Left: the supervisor pattern, with one model spending tokens to route and three sub-agents to coordinate. Right: the same capability as one agent calling the same tools directly, no inter-agent messages to decay.

Look at the two sides of that diagram. The right-hand version has the same tools. It can read, search, and draft, exactly like the left-hand fleet. What it does not have is three handoffs where meaning leaks, a supervisor burning tokens to decide who speaks, and four transcripts to reconcile when it goes wrong. For most of what people build, the right side is not a downgrade. It is the version that works on a Tuesday when something breaks and you need to find out why before the on-call patience runs out.

So here is the complexity I refuse. I refuse agents that exist only to critique another agent, when an eval or a validation check in code does the same job deterministically and you can actually trust the result. I refuse a router-agent deciding control flow that a match statement would handle, because a model choosing a branch is a non-deterministic bug generator wearing a smart hat. I refuse “agent debates agent” setups, which are mostly a way to spend tokens performing rigor. And I refuse any agent boundary that exists because the diagram looked thin without it.

The plain-code option deserves its own line, because it is the one people skip. A surprising share of “agentic” work is a fixed sequence of three known steps with one model call in the middle. That is not an agent. That is a function with an LLM in it, and you should write it as a function: testable, traceable, cheap, and it does the same thing every time you run it. Reach for an agent loop only when the path genuinely cannot be known ahead of time. Reach for several only when one of those three reasons is true and you can name which one out loud.

The question I make every team answer before they get a second agent: what does this agent do that a tool call inside the first one could not? If the answer is real parallelism, isolated privilege, or a context that genuinely needs its own brain, build it. If the answer is some version of “it’s cleaner,” you do not have a multi-agent problem. You have one agent and a diagram you got attached to.