Learn · AI Engineering · Deep dive

    Multi-agent systems in production

    Most "multi-agent" deployments in 2026 are single-agent systems wrapped in vendor marketing. The actual question for production is when the multi-agent complexity earns its keep — and how to engineer the seams between agents so they don't silently rot. This article: five patterns that work, five anti-patterns that don't, and the discipline that separates a working system from a research demo.

    Updated · 2026-05-03 · 11 min read

    When to go multi-agent

    Three honest signals that justify the multi-agent step up.

    1. 01Sub-tasks are fundamentally different. A research agent and a writing agent benefit from different tools, different prompts, and different model tiers. Cramming both into one prompt produces worse output than splitting them.
    2. 02Parallelization unlocks throughput. Diligence over 50 documents, comparison shopping across 20 vendors, multi-source research synthesis — these benefit from parallel workers. Latency drops linearly with concurrency.
    3. 03Quality benefits from review/critique. One agent generates; another reviews. Demonstrably improves hard reasoning tasks. Worth the cost only when output quality is highly leveraged (e.g. legal drafting, code generation).

    If you can't articulate which of those three applies, you don't need multi-agent. Promote later when you actually do.

    Five patterns that work

    Single-loop ReAct

    80% of agent workloads

    One agent. One control loop. All tools available from step one. The model picks a tool, runs it, observes the result, decides what's next. Cheap, observable, easy to debug. Most production agents AISD ships use this — even when vendors call them 'multi-agent'.

    Plan-and-execute

    Complex tasks where premature tool calls waste cost

    Agent first writes a plan (a list of steps), then executes each step. Better than ReAct for jobs that benefit from a coherent strategy upfront — e.g., diligence work, research synthesis, multi-file edits. Same single-agent backbone; just two passes.

    Routed sub-agents

    Multiple distinct workloads with shared context

    A router agent dispatches to specialized sub-agents (researcher, drafter, reviewer). Each sub-agent has its own toolset and prompts. Useful when sub-tasks are fundamentally different and one prompt can't cover all of them well.

    Hierarchical (manager + workers)

    Long-running tasks with parallel sub-tasks

    A manager agent decomposes a task and assigns sub-tasks to worker agents in parallel. Workers report back; the manager synthesizes. Used for diligence-style sweeps over many documents, or research aggregation. Cost and complexity grow fast.

    Debate / consensus

    High-stakes decisions where one model's output isn't enough

    Two or more agents argue, propose, critique. Final answer comes from a judge or majority vote. Improves quality on hard reasoning tasks at 2–4× cost. Rare in production; common in research benchmarks.

    Five anti-patterns

    Multi-agent for the sake of multi-agent

    Adding a second agent because the framework supports it, not because the workflow needs it. Every additional agent adds tokens, latency, and debugging surface. Default to one agent; promote to multi only when you can articulate why.

    Free-form agent-to-agent chat

    Agents chatting at each other in unstructured natural language. Token cost balloons; observability collapses. Always define typed messages between agents — same discipline as inter-service contracts.

    Shared memory free-for-all

    Every agent reads and writes the same scratch pad. Race conditions, prompt-injection attack surface, and impossible-to-debug failures. Treat shared state as a data contract; every read/write goes through a typed API.

    No termination conditions

    Agents that loop indefinitely waiting for the goal to be 'done.' Always set max-step caps, max-cost caps, and max-wallclock budgets at every level — manager and worker.

    No eval harness for the orchestration layer

    Per-agent evals exist; system-level evals don't. Multi-agent systems fail at the seams between agents — the evals you need are end-to-end, not per-component.

    Production discipline

    The seams between agents are where multi-agent systems silently rot. These are the disciplines AISD applies on every multi-agent engagement.

    Typed messages between agents

    Use a schema (Pydantic, Zod, JSON schema) for inter-agent communication. Validate on both ends. This is the single highest-leverage discipline in multi-agent systems.

    Cost + step caps at every level

    Manager has a budget. Workers have budgets. Tool calls have budgets. Cap them all; alert before they're hit.

    Per-agent observability

    Every agent's inputs, outputs, tool calls, and decisions logged with a consistent trace ID. LangSmith, Langfuse, or homegrown OpenTelemetry — pick one and instrument everywhere.

    Deterministic orchestration logic outside the agent

    Don't let an agent decide whether to spawn a sub-agent. Wrap orchestration in deterministic code; agents only handle the per-step reasoning. This makes failure modes tractable.

    Eval harness against end-to-end outcomes

    A golden test set of 50–500 representative inputs, scored against the user-visible outcome. Run on every PR. Multi-agent systems regress at the seams — only end-to-end evals catch this.

    AISD's default

    For new production engagements, AISD's default is single-loop ReAct with carefully designed tools. Most "multi-agent" workloads are better served by a single agent calling well-typed tools than by multiple agents passing free-form messages.

    We promote to plan-and-execute when the workload benefits from upfront strategy. We promote to routed sub-agents when sub-tasks are genuinely different. We promote to hierarchical only when parallelization is the metric mover. Debate-style multi-agent: rarely, on hard reasoning where quality matters more than cost.

    The discipline isn't picking the fanciest pattern — it's picking the simplest one that solves the problem. Vendors optimize for differentiation; production optimizes for the pager not going off at 3am.

    Adjacent reading: LLM orchestration patterns → and agentic AI design patterns →

    Next step

    Ship a production multi-agent system.

    A 30-minute call gets you a fixed-price scope and an honest take on whether multi-agent is the right call for your use case.