Learn · AI Engineering

    Eval harness for LLM apps

    An eval harness is an automated test suite that measures your LLM application's accuracy, reliability, cost, and latency continuously. It is the difference between "it works in the demo" and "it works in production at scale." Every AISD engagement ships with one from day one.

    Updated · 2026-05-02 · 8 min read

    Five layers

    The eval stack, from foundation up

    A production eval harness has five layers. Each builds on the one below. Skip a layer and you'll have blind spots that only show up in production.

    Layer 01

    Unit evals (per-call accuracy)

    Test individual LLM calls against labeled examples. Does the model return the right answer for this specific input? This is the foundation. Without unit evals, everything else is guessing.

    Examples

    • Classification accuracy on 200+ labeled samples
    • Extraction precision/recall on structured fields
    • Summarization ROUGE scores against human-written summaries

    Layer 02

    Workflow evals (end-to-end correctness)

    Test the full pipeline: retrieval + generation + tool calls + output formatting. A workflow eval catches failures that unit evals miss: wrong documents retrieved, tool calls in wrong order, final output format broken.

    Examples

    • Agent completes a 5-step task correctly
    • RAG pipeline returns cited answers with correct sources
    • Workflow produces valid JSON matching the schema

    Layer 03

    Regression evals (did we break something?)

    Run the full eval suite on every code change and every model update. This is your CI gate. If a new prompt or model version drops accuracy below threshold, the deploy is blocked.

    Examples

    • Accuracy stays above 92% after prompt change
    • Latency p95 stays under 3s after model swap
    • No new failure modes in the top-20 edge cases

    Layer 04

    Cost and latency tracking

    Track tokens consumed, inference cost, and latency per call and per workflow. Cost drift is the silent killer of production LLM apps. A 10% prompt change can 3x your token usage.

    Examples

    • Cost per workflow run trends over time
    • Latency percentiles (p50, p95, p99) per model
    • Token budget alerts when usage exceeds threshold

    Layer 05

    Drift detection (production monitoring)

    Monitor live inputs and outputs for distribution shift. When real-world inputs diverge from your eval dataset, accuracy degrades silently. Drift detection catches this before users do.

    Examples

    • Input length distribution monitoring
    • Output confidence score trends
    • New intent clusters appearing in production traffic

    Getting started

    Build your first eval in 4 hours

    1. 01Collect 50-200 labeled examples from your domain experts. Golden datasets beat synthetic data.
    2. 02Write unit evals that test your LLM calls against these labels. Measure accuracy, not vibes.
    3. 03Wire the evals into CI so they run on every commit. Block deploys that drop below threshold.
    4. 04Add cost and latency tracking. Log every inference call with tokens, model, and wall-clock time.
    5. 05After 2 weeks in production, build drift detection from your real traffic distribution.

    Next step

    Ship AI that's measurably reliable.

    Every AISD engagement includes an eval harness. Talk to us about building one for your workload.