Learn · How-to · AI Engineering
How to build an AI agent.
Six steps. From goal definition to production deployment. Drawn from 30+ AI agents AISD has shipped to mid-market and enterprise customers in the last 18 months.
Updated · 2026-05-04 · 12 min read
Step 01
Define the goal and success metrics
Write down the specific outcome the agent must produce — and the measurable criteria. Auto-resolution rate, p95 latency, cost ceiling per session, escalation rate. If you can't write the metric in one sentence, you don't have a goal yet — keep iterating.
Agents that don't have a measurable goal end up as demos. The metric anchors every later decision: which tools to expose, which orchestration pattern to pick, what the eval harness scores against, when you're allowed to declare 'done.'
Step 02
Map the tools the agent will need
List every API, database, and side-effecting action the agent must call. Define the typed schema for each. Identify which actions need human-in-the-loop approval. Each tool is an integration; this is where most of the build time goes.
Tool design is the single highest-leverage decision. A clean tool schema with typed inputs and explicit error states means the model can recover gracefully. A messy tool schema means the model invents arguments and your circuit breakers fire all day.
Step 03
Pick the orchestration pattern
Single-loop ReAct, plan-and-execute, or multi-agent graph (LangGraph). Pick for the actual problem, not for novelty. Most production agents are single-loop ReAct or plan-and-execute. Multi-agent is rarer than vendor marketing implies.
We default to single-loop for simple tools-and-decisions workflows. Plan-and-execute for tasks where the agent needs to outline before acting (research, multi-step writing). Multi-agent only when sub-tasks are fundamentally different and the cost of additional model calls is justified.
Step 04
Build the eval harness on day 1
A golden test set of 50–500 representative inputs scored automatically (model-graded) and by humans on a sample. Run on every PR. Without this you're shipping vibes — you'll hit production drift in week 4 with no way to measure it.
Eval-harness rigor is the difference between agents that survive and agents that die. Golden test set + automated scoring + a weekly human review of low-confidence cases. Score business metrics (resolution rate, accuracy on a labeled task), not just LLM-self-rated quality.
Step 05
Add guardrails
Input sanitization, output schema validation, prompt-injection adversarial test suite in CI, rate limits, per-session cost caps, circuit breakers on every tool call. Side-effecting actions gated by confidence thresholds.
Production agents fail in four ways: tool errors, prompt injection, cost spirals, distribution shift. Guardrails address each. The adversarial test suite is non-negotiable; if you can't break your own agent in CI, an attacker will break it in production.
Step 06
Deploy with observability and human-in-the-loop escalation
Every model call logged with cost, latency, tool-call success, schema-validation pass. Weekly review of low-confidence cases, fed back into the test set. Escalation path to a human when confidence is low or the action is consequential.
Agents in production are continuously monitored systems, not 'fire and forget' deploys. The weekly review ritual catches drift before it becomes an incident. Treat it the same way you treat oncall for any production system — because it is one.
Common mistakes we see
- Building before defining the success metric. Without a metric, you can't tell if iteration is helping. Most agents that get killed in production were never measurable from week 1.
- Reaching for multi-agent. Most "multi-agent" deployments are actually one agent with well-designed tools. Multi-agent costs more, fails more, and is harder to evaluate.
- Skipping the eval harness. "We'll add evals later" is the same lie as "we'll add tests later." It does not get added later.
- Treating prompt injection as a bug bash. Adversarial inputs from users, scraped pages, and email bodies are a structural threat. Defense is architectural, not patch-by-patch.