Learn · AI Engineering · Deep dive
Multi-agent systems in production
Most "multi-agent" deployments in 2026 are single-agent systems wrapped in vendor marketing. The actual question for production is when the multi-agent complexity earns its keep — and how to engineer the seams between agents so they don't silently rot. This article: five patterns that work, five anti-patterns that don't, and the discipline that separates a working system from a research demo.
Updated · 2026-05-03 · 11 min read
When to go multi-agent
Three honest signals that justify the multi-agent step up.
- 01Sub-tasks are fundamentally different. A research agent and a writing agent benefit from different tools, different prompts, and different model tiers. Cramming both into one prompt produces worse output than splitting them.
- 02Parallelization unlocks throughput. Diligence over 50 documents, comparison shopping across 20 vendors, multi-source research synthesis — these benefit from parallel workers. Latency drops linearly with concurrency.
- 03Quality benefits from review/critique. One agent generates; another reviews. Demonstrably improves hard reasoning tasks. Worth the cost only when output quality is highly leveraged (e.g. legal drafting, code generation).
If you can't articulate which of those three applies, you don't need multi-agent. Promote later when you actually do.
Five patterns that work
Single-loop ReAct
80% of agent workloads
One agent. One control loop. All tools available from step one. The model picks a tool, runs it, observes the result, decides what's next. Cheap, observable, easy to debug. Most production agents AISD ships use this — even when vendors call them 'multi-agent'.
Plan-and-execute
Complex tasks where premature tool calls waste cost
Agent first writes a plan (a list of steps), then executes each step. Better than ReAct for jobs that benefit from a coherent strategy upfront — e.g., diligence work, research synthesis, multi-file edits. Same single-agent backbone; just two passes.
Routed sub-agents
Multiple distinct workloads with shared context
A router agent dispatches to specialized sub-agents (researcher, drafter, reviewer). Each sub-agent has its own toolset and prompts. Useful when sub-tasks are fundamentally different and one prompt can't cover all of them well.
Hierarchical (manager + workers)
Long-running tasks with parallel sub-tasks
A manager agent decomposes a task and assigns sub-tasks to worker agents in parallel. Workers report back; the manager synthesizes. Used for diligence-style sweeps over many documents, or research aggregation. Cost and complexity grow fast.
Debate / consensus
High-stakes decisions where one model's output isn't enough
Two or more agents argue, propose, critique. Final answer comes from a judge or majority vote. Improves quality on hard reasoning tasks at 2–4× cost. Rare in production; common in research benchmarks.
Five anti-patterns
Multi-agent for the sake of multi-agent
Adding a second agent because the framework supports it, not because the workflow needs it. Every additional agent adds tokens, latency, and debugging surface. Default to one agent; promote to multi only when you can articulate why.
Free-form agent-to-agent chat
Agents chatting at each other in unstructured natural language. Token cost balloons; observability collapses. Always define typed messages between agents — same discipline as inter-service contracts.
Shared memory free-for-all
Every agent reads and writes the same scratch pad. Race conditions, prompt-injection attack surface, and impossible-to-debug failures. Treat shared state as a data contract; every read/write goes through a typed API.
No termination conditions
Agents that loop indefinitely waiting for the goal to be 'done.' Always set max-step caps, max-cost caps, and max-wallclock budgets at every level — manager and worker.
No eval harness for the orchestration layer
Per-agent evals exist; system-level evals don't. Multi-agent systems fail at the seams between agents — the evals you need are end-to-end, not per-component.
Production discipline
The seams between agents are where multi-agent systems silently rot. These are the disciplines AISD applies on every multi-agent engagement.
Typed messages between agents
Use a schema (Pydantic, Zod, JSON schema) for inter-agent communication. Validate on both ends. This is the single highest-leverage discipline in multi-agent systems.
Cost + step caps at every level
Manager has a budget. Workers have budgets. Tool calls have budgets. Cap them all; alert before they're hit.
Per-agent observability
Every agent's inputs, outputs, tool calls, and decisions logged with a consistent trace ID. LangSmith, Langfuse, or homegrown OpenTelemetry — pick one and instrument everywhere.
Deterministic orchestration logic outside the agent
Don't let an agent decide whether to spawn a sub-agent. Wrap orchestration in deterministic code; agents only handle the per-step reasoning. This makes failure modes tractable.
Eval harness against end-to-end outcomes
A golden test set of 50–500 representative inputs, scored against the user-visible outcome. Run on every PR. Multi-agent systems regress at the seams — only end-to-end evals catch this.
AISD's default
For new production engagements, AISD's default is single-loop ReAct with carefully designed tools. Most "multi-agent" workloads are better served by a single agent calling well-typed tools than by multiple agents passing free-form messages.
We promote to plan-and-execute when the workload benefits from upfront strategy. We promote to routed sub-agents when sub-tasks are genuinely different. We promote to hierarchical only when parallelization is the metric mover. Debate-style multi-agent: rarely, on hard reasoning where quality matters more than cost.
The discipline isn't picking the fanciest pattern — it's picking the simplest one that solves the problem. Vendors optimize for differentiation; production optimizes for the pager not going off at 3am.
Adjacent reading: LLM orchestration patterns → and agentic AI design patterns →