Learn · AI Engineering · Data-driven

    Cost of LLM inference in 2026

    Pricing has fallen 10× in two years. Frontier-quality output that cost $30 per million tokens in 2024 is now under $3. But production AI bills are getting larger, not smaller — because volume is scaling faster than per-token costs are dropping. This article: current pricing across providers, the self-host break-even, and the five levers that actually move a bill at scale.

    Updated · 2026-05-03 · 9 min read · Pricing reflects published rates; negotiated enterprise pricing varies.

    Frontier API pricing

    Prices in USD per 1 million tokens. "Cache" is the cached input rate (where supported); the discount typically applies to system prompts and tool definitions repeated across a session.

    ProviderModelInputOutputCacheWhen to use
    AnthropicClaude Opus 4$15.00$75.00$1.50Highest-reasoning frontier; agent-tier accuracy
    AnthropicClaude Sonnet 4$3.00$15.00$0.30Default for most production agent loops
    AnthropicClaude Haiku 4$0.80$4.00$0.08Lightweight tools + classification
    OpenAIGPT-5 (frontier)$5.00$20.00$0.50Frontier reasoning, broad ecosystem
    OpenAIGPT-5 mini$1.25$5.00$0.13Volume workloads, fast
    GoogleGemini 2.5 Pro$2.50$10.00$0.31Long context (2M tokens)
    GoogleGemini 2.5 Flash$0.30$1.20$0.04Cheap + fast for high-volume

    Two patterns in this table. First: output tokens are 3–5× input tokens at every provider — controlling output length is the highest-leverage cost decision in your stack. Second: cached input is 90% off at Anthropic and Google. If your system prompts are stable, you should be using prompt caching; it's free money.

    Open-weight pricing

    Open-weight models (Llama, Mistral, Qwen) are increasingly competitive on quality and dramatically cheaper at volume. Hosted by inference platforms (Together, Fireworks, Anyscale, Groq) or self-hosted on dedicated GPUs.

    ModelHostInputOutputNotes
    Llama 3.3 70BTogether AI$0.88$0.88Self-host break-even ~5M tokens/day
    Llama 3.1 405BFireworks$3.00$3.00Frontier-adjacent quality, dedicated GPU recommended for stable latency
    Mistral Large 2Mistral API$2.00$6.00Strong tool-use; EU-hosted option
    Qwen 3 72BSelf-hosted (H100)~$0.40~$0.40Self-hosted at $4–6/hr per GPU; depends on utilization

    Self-host break-even

    A dedicated H100 GPU runs $3–6/hour at most cloud providers. Llama 3.3 70B at vLLM serving rate (~50 tok/s output) produces ~180k output tokens per hour. At Together AI's $0.88/M token rate, that hour of compute is worth $0.16 in output billing — so self-hosting pays off only at very high utilization.

    Rule of thumb: self-hosting on a single GPU breaks even around 5–10 million tokens per day per GPU. Below that, hosted APIs (Together, Fireworks) are cheaper because of utilization economics. Above that, dedicated infrastructure is competitive — and gives you compliance, latency, and data sovereignty wins on top.

    For most AISD engagements, the cost calculus matters less than the architectural one: data residency, audit requirements, and SLA control are the actual reasons to self-host.

    Five cost levers that actually move bills

    These are the cost reductions we apply on every production AISD engagement. In aggregate they typically drop a naive implementation's bill by 60–95%.

    Prompt caching

    30–90% cost reduction

    If your system prompts and tool definitions are stable across requests, cache them. Anthropic and Google offer 90% discounts on cached input tokens. OpenAI offers automatic prefix caching at 50–75% off. The math compounds at scale.

    Model routing

    40–70% cost reduction

    Most queries don't need the frontier model. Route easy classifications to Haiku/mini/Flash; reserve Opus/GPT-5 for complex reasoning. A two-tier router with a small classifier upstream often pays for itself in days.

    Output token discipline

    20–50% cost reduction

    Output tokens are 3–5× the price of input tokens at most providers. Constrain outputs with structured schemas. Avoid 'think out loud' patterns that double-count reasoning in billed output. Use thinking budgets where models support them.

    Batch APIs

    50% cost reduction

    OpenAI and Anthropic offer 50%-off batch processing with 24-hour SLAs. Backfills, evaluations, classification at rest, and ETL augmentation are perfect candidates. Move them off the synchronous path.

    Embedding-first retrieval

    60–95% cost reduction (vs. long-context loading)

    Don't stuff 100k tokens of docs into every prompt. Embed once, retrieve top-K. Embedding cost is $0.02–$0.13 per 1M tokens — orders of magnitude cheaper than re-feeding context every call.

    Picking a default

    For greenfield production agents, AISD's default is Claude Sonnet 4 with prompt caching for the planning + tool-calling loop, and Haiku 4 or Gemini 2.5 Flash for classification, routing, and short responses. Reserve Opus 4 or GPT-5 frontier for the hard reasoning steps where output quality differential measurably affects the metric.

    Open-weight models earn their place in three scenarios: very high volume (per-call cost dominates), strict latency targets (dedicated GPU + first-token latency), and data sovereignty requirements (model never leaves your perimeter). Outside those, frontier APIs win on quality, ecosystem, and developer velocity.

    Want a real cost estimate for your specific use case? Use our interactive ROI calculator →

    Next step

    Cost-optimize a production AI workload.

    A 30-minute call gets you a sized estimate against your actual usage profile and 3 quick-win recommendations.