Learn · AI Engineering · Data-driven
Cost of LLM inference in 2026
Pricing has fallen 10× in two years. Frontier-quality output that cost $30 per million tokens in 2024 is now under $3. But production AI bills are getting larger, not smaller — because volume is scaling faster than per-token costs are dropping. This article: current pricing across providers, the self-host break-even, and the five levers that actually move a bill at scale.
Updated · 2026-05-03 · 9 min read · Pricing reflects published rates; negotiated enterprise pricing varies.
Frontier API pricing
Prices in USD per 1 million tokens. "Cache" is the cached input rate (where supported); the discount typically applies to system prompts and tool definitions repeated across a session.
| Provider | Model | Input | Output | Cache | When to use |
|---|---|---|---|---|---|
| Anthropic | Claude Opus 4 | $15.00 | $75.00 | $1.50 | Highest-reasoning frontier; agent-tier accuracy |
| Anthropic | Claude Sonnet 4 | $3.00 | $15.00 | $0.30 | Default for most production agent loops |
| Anthropic | Claude Haiku 4 | $0.80 | $4.00 | $0.08 | Lightweight tools + classification |
| OpenAI | GPT-5 (frontier) | $5.00 | $20.00 | $0.50 | Frontier reasoning, broad ecosystem |
| OpenAI | GPT-5 mini | $1.25 | $5.00 | $0.13 | Volume workloads, fast |
| Gemini 2.5 Pro | $2.50 | $10.00 | $0.31 | Long context (2M tokens) | |
| Gemini 2.5 Flash | $0.30 | $1.20 | $0.04 | Cheap + fast for high-volume |
Two patterns in this table. First: output tokens are 3–5× input tokens at every provider — controlling output length is the highest-leverage cost decision in your stack. Second: cached input is 90% off at Anthropic and Google. If your system prompts are stable, you should be using prompt caching; it's free money.
Open-weight pricing
Open-weight models (Llama, Mistral, Qwen) are increasingly competitive on quality and dramatically cheaper at volume. Hosted by inference platforms (Together, Fireworks, Anyscale, Groq) or self-hosted on dedicated GPUs.
| Model | Host | Input | Output | Notes |
|---|---|---|---|---|
| Llama 3.3 70B | Together AI | $0.88 | $0.88 | Self-host break-even ~5M tokens/day |
| Llama 3.1 405B | Fireworks | $3.00 | $3.00 | Frontier-adjacent quality, dedicated GPU recommended for stable latency |
| Mistral Large 2 | Mistral API | $2.00 | $6.00 | Strong tool-use; EU-hosted option |
| Qwen 3 72B | Self-hosted (H100) | ~$0.40 | ~$0.40 | Self-hosted at $4–6/hr per GPU; depends on utilization |
Self-host break-even
A dedicated H100 GPU runs $3–6/hour at most cloud providers. Llama 3.3 70B at vLLM serving rate (~50 tok/s output) produces ~180k output tokens per hour. At Together AI's $0.88/M token rate, that hour of compute is worth $0.16 in output billing — so self-hosting pays off only at very high utilization.
Rule of thumb: self-hosting on a single GPU breaks even around 5–10 million tokens per day per GPU. Below that, hosted APIs (Together, Fireworks) are cheaper because of utilization economics. Above that, dedicated infrastructure is competitive — and gives you compliance, latency, and data sovereignty wins on top.
For most AISD engagements, the cost calculus matters less than the architectural one: data residency, audit requirements, and SLA control are the actual reasons to self-host.
Five cost levers that actually move bills
These are the cost reductions we apply on every production AISD engagement. In aggregate they typically drop a naive implementation's bill by 60–95%.
Prompt caching
30–90% cost reduction
If your system prompts and tool definitions are stable across requests, cache them. Anthropic and Google offer 90% discounts on cached input tokens. OpenAI offers automatic prefix caching at 50–75% off. The math compounds at scale.
Model routing
40–70% cost reduction
Most queries don't need the frontier model. Route easy classifications to Haiku/mini/Flash; reserve Opus/GPT-5 for complex reasoning. A two-tier router with a small classifier upstream often pays for itself in days.
Output token discipline
20–50% cost reduction
Output tokens are 3–5× the price of input tokens at most providers. Constrain outputs with structured schemas. Avoid 'think out loud' patterns that double-count reasoning in billed output. Use thinking budgets where models support them.
Batch APIs
50% cost reduction
OpenAI and Anthropic offer 50%-off batch processing with 24-hour SLAs. Backfills, evaluations, classification at rest, and ETL augmentation are perfect candidates. Move them off the synchronous path.
Embedding-first retrieval
60–95% cost reduction (vs. long-context loading)
Don't stuff 100k tokens of docs into every prompt. Embed once, retrieve top-K. Embedding cost is $0.02–$0.13 per 1M tokens — orders of magnitude cheaper than re-feeding context every call.
Picking a default
For greenfield production agents, AISD's default is Claude Sonnet 4 with prompt caching for the planning + tool-calling loop, and Haiku 4 or Gemini 2.5 Flash for classification, routing, and short responses. Reserve Opus 4 or GPT-5 frontier for the hard reasoning steps where output quality differential measurably affects the metric.
Open-weight models earn their place in three scenarios: very high volume (per-call cost dominates), strict latency targets (dedicated GPU + first-token latency), and data sovereignty requirements (model never leaves your perimeter). Outside those, frontier APIs win on quality, ecosystem, and developer velocity.
Want a real cost estimate for your specific use case? Use our interactive ROI calculator →