AISD is an AI-native software development company that builds production AI for mid-market and enterprise teams. Three core services: AI Modernization (embedding AI into existing products — copilots, intelligent search, predictive analytics), AI Agents (autonomous workflows for support, document processing, sales outreach), and AI Workflow Automation (n8n, Zapier, Make, Clay).

How is AISD different from a typical software development agency?

Three differences. First, every AISD engineer is senior — minimum 5 years building production software, with shipped AI features. Second, we publish hourly engagement bands and project ranges so you know roughly what an engagement costs before the first call. Third, we take fewer concurrent projects so a partner stays close to delivery.

How long does it take to build an AI MVP?

Most AI MVPs at AISD ship a usable version in 4–8 weeks. Week 1 is a discovery sprint. Weeks 2–6 are the build, with weekly demos and a working version by week 4. Weeks 7–8 harden, document, and hand off.

What does an AI MVP cost?

AISD AI MVPs typically range $45,000–$120,000 depending on scope. Drivers: number of model integrations, complexity of retrieval/data layer, custom UI surface area, and compliance requirements. We publish indicative bands on the pricing page so buyers can budget before the first call.

How do you ensure AI features are reliable in production?

Five layers: an offline eval harness with golden test sets run on every PR; confidence thresholds and structured-output validation that gate downstream side effects; runtime observability — every model call logged with inputs, outputs, latency, cost; circuit breakers and deterministic fallbacks for every model dependency; and a weekly review ritual where prompt regressions get caught before they become incidents.

How long does it take to build a production AI agent?

Working prototype: 2 weeks. Production-grade agent (with eval harness, guardrails, observability, and a runbook): 6–10 weeks. The prototype-to-production gap is where most projects fail — the prototype handles the happy path; production has to handle the long tail.

What does it cost to build an AI agent?

A production AI agent at AISD typically costs $40,000–$150,000 depending on complexity. Drivers: number of integrated systems, evaluation rigor required, compliance overhead, and ongoing operational scope. Prototypes alone are cheaper ($10k–$25k) but rarely worth it without a path to production.

How does pricing work — fixed-price, T&M, or retainer?

All three. Fixed-price for AI MVPs and agent builds where scope is well-defined after a discovery sprint. Time-and-materials for staff augmentation, billed monthly with a not-to-exceed ceiling. Retainer for ongoing optimization, eval-harness operations, and managed AI services — flat monthly fee for a defined scope of capacity.

Learn · AI Engineering · Data-driven

Cost of LLM inference in 2026

Pricing has fallen 10× in two years. Frontier-quality output that cost $30 per million tokens in 2024 is now under $3. But production AI bills are getting larger, not smaller — because volume is scaling faster than per-token costs are dropping. This article: current pricing across providers, the self-host break-even, and the five levers that actually move a bill at scale.

Updated · 2026-05-03 · 9 min read · Pricing reflects published rates; negotiated enterprise pricing varies.

Frontier API pricing

Prices in USD per 1 million tokens. "Cache" is the cached input rate (where supported); the discount typically applies to system prompts and tool definitions repeated across a session.

Provider	Model	Input	Output	Cache	When to use
Anthropic	Claude Opus 4	$15.00	$75.00	$1.50	Highest-reasoning frontier; agent-tier accuracy
Anthropic	Claude Sonnet 4	$3.00	$15.00	$0.30	Default for most production agent loops
Anthropic	Claude Haiku 4	$0.80	$4.00	$0.08	Lightweight tools + classification
OpenAI	GPT-5 (frontier)	$5.00	$20.00	$0.50	Frontier reasoning, broad ecosystem
OpenAI	GPT-5 mini	$1.25	$5.00	$0.13	Volume workloads, fast
Google	Gemini 2.5 Pro	$2.50	$10.00	$0.31	Long context (2M tokens)
Google	Gemini 2.5 Flash	$0.30	$1.20	$0.04	Cheap + fast for high-volume

Two patterns in this table. First: output tokens are 3–5× input tokens at every provider — controlling output length is the highest-leverage cost decision in your stack. Second: cached input is 90% off at Anthropic and Google. If your system prompts are stable, you should be using prompt caching; it's free money.

Open-weight pricing

Open-weight models (Llama, Mistral, Qwen) are increasingly competitive on quality and dramatically cheaper at volume. Hosted by inference platforms (Together, Fireworks, Anyscale, Groq) or self-hosted on dedicated GPUs.

Model	Host	Input	Output	Notes
Llama 3.3 70B	Together AI	$0.88	$0.88	Self-host break-even ~5M tokens/day
Llama 3.1 405B	Fireworks	$3.00	$3.00	Frontier-adjacent quality, dedicated GPU recommended for stable latency
Mistral Large 2	Mistral API	$2.00	$6.00	Strong tool-use; EU-hosted option
Qwen 3 72B	Self-hosted (H100)	~$0.40	~$0.40	Self-hosted at $4–6/hr per GPU; depends on utilization

Self-host break-even

A dedicated H100 GPU runs $3–6/hour at most cloud providers. Llama 3.3 70B at vLLM serving rate (~50 tok/s output) produces ~180k output tokens per hour. At Together AI's $0.88/M token rate, that hour of compute is worth $0.16 in output billing — so self-hosting pays off only at very high utilization.

Rule of thumb: self-hosting on a single GPU breaks even around 5–10 million tokens per day per GPU. Below that, hosted APIs (Together, Fireworks) are cheaper because of utilization economics. Above that, dedicated infrastructure is competitive — and gives you compliance, latency, and data sovereignty wins on top.

For most AISD engagements, the cost calculus matters less than the architectural one: data residency, audit requirements, and SLA control are the actual reasons to self-host.

Five cost levers that actually move bills

These are the cost reductions we apply on every production AISD engagement. In aggregate they typically drop a naive implementation's bill by 60–95%.

Prompt caching

30–90% cost reduction

If your system prompts and tool definitions are stable across requests, cache them. Anthropic and Google offer 90% discounts on cached input tokens. OpenAI offers automatic prefix caching at 50–75% off. The math compounds at scale.

Model routing

40–70% cost reduction

Most queries don't need the frontier model. Route easy classifications to Haiku/mini/Flash; reserve Opus/GPT-5 for complex reasoning. A two-tier router with a small classifier upstream often pays for itself in days.

Output token discipline

20–50% cost reduction

Output tokens are 3–5× the price of input tokens at most providers. Constrain outputs with structured schemas. Avoid 'think out loud' patterns that double-count reasoning in billed output. Use thinking budgets where models support them.

Batch APIs

50% cost reduction

OpenAI and Anthropic offer 50%-off batch processing with 24-hour SLAs. Backfills, evaluations, classification at rest, and ETL augmentation are perfect candidates. Move them off the synchronous path.

Embedding-first retrieval

60–95% cost reduction (vs. long-context loading)

Don't stuff 100k tokens of docs into every prompt. Embed once, retrieve top-K. Embedding cost is $0.02–$0.13 per 1M tokens — orders of magnitude cheaper than re-feeding context every call.

Picking a default

For greenfield production agents, AISD's default is Claude Sonnet 4 with prompt caching for the planning + tool-calling loop, and Haiku 4 or Gemini 2.5 Flash for classification, routing, and short responses. Reserve Opus 4 or GPT-5 frontier for the hard reasoning steps where output quality differential measurably affects the metric.

Open-weight models earn their place in three scenarios: very high volume (per-call cost dominates), strict latency targets (dedicated GPU + first-token latency), and data sovereignty requirements (model never leaves your perimeter). Outside those, frontier APIs win on quality, ecosystem, and developer velocity.

Want a real cost estimate for your specific use case? Use our interactive ROI calculator →