Learn · AI Engineering
Eval harness for LLM apps
An eval harness is an automated test suite that measures your LLM application's accuracy, reliability, cost, and latency continuously. It is the difference between "it works in the demo" and "it works in production at scale." Every AISD engagement ships with one from day one.
Updated · 2026-05-02 · 8 min read
Five layers
The eval stack, from foundation up
A production eval harness has five layers. Each builds on the one below. Skip a layer and you'll have blind spots that only show up in production.
Layer 01
Unit evals (per-call accuracy)
Test individual LLM calls against labeled examples. Does the model return the right answer for this specific input? This is the foundation. Without unit evals, everything else is guessing.
Examples
- Classification accuracy on 200+ labeled samples
- Extraction precision/recall on structured fields
- Summarization ROUGE scores against human-written summaries
Layer 02
Workflow evals (end-to-end correctness)
Test the full pipeline: retrieval + generation + tool calls + output formatting. A workflow eval catches failures that unit evals miss: wrong documents retrieved, tool calls in wrong order, final output format broken.
Examples
- Agent completes a 5-step task correctly
- RAG pipeline returns cited answers with correct sources
- Workflow produces valid JSON matching the schema
Layer 03
Regression evals (did we break something?)
Run the full eval suite on every code change and every model update. This is your CI gate. If a new prompt or model version drops accuracy below threshold, the deploy is blocked.
Examples
- Accuracy stays above 92% after prompt change
- Latency p95 stays under 3s after model swap
- No new failure modes in the top-20 edge cases
Layer 04
Cost and latency tracking
Track tokens consumed, inference cost, and latency per call and per workflow. Cost drift is the silent killer of production LLM apps. A 10% prompt change can 3x your token usage.
Examples
- Cost per workflow run trends over time
- Latency percentiles (p50, p95, p99) per model
- Token budget alerts when usage exceeds threshold
Layer 05
Drift detection (production monitoring)
Monitor live inputs and outputs for distribution shift. When real-world inputs diverge from your eval dataset, accuracy degrades silently. Drift detection catches this before users do.
Examples
- Input length distribution monitoring
- Output confidence score trends
- New intent clusters appearing in production traffic
Getting started
Build your first eval in 4 hours
- 01Collect 50-200 labeled examples from your domain experts. Golden datasets beat synthetic data.
- 02Write unit evals that test your LLM calls against these labels. Measure accuracy, not vibes.
- 03Wire the evals into CI so they run on every commit. Block deploys that drop below threshold.
- 04Add cost and latency tracking. Log every inference call with tokens, model, and wall-clock time.
- 05After 2 weeks in production, build drift detection from your real traffic distribution.