How it works

Every term, every formula, and how we source the pricing data — explained plainly.

The two types of cost

The calculator separates costs into two fundamentally different categories. This distinction matters because one type is calculated from published provider rates, and the other depends on your own infrastructure choices.

API / Usage costs

Calculated · Verifiable

Charged by AI providers per token, per API call, or per page. Rates come directly from provider pricing pages. You can verify them independently. Includes: LLM inference, search tool calls, embedding generation, OCR.

Operating overhead

Your estimates · Not calculated

Costs you supply based on your own infrastructure. The calculator cannot calculate these for you — they depend on your hosting, database, and vendor choices. Includes: server hosting, vector database, monitoring tools.

Why separate them? Mixing provider-calculated costs with self-reported estimates creates a single number that looks precise but isn't. Showing them separately lets you trust the API portion while adjusting the operating portion to match your own setup.

Core concepts: tokens, workflows, and calls

Everything in the calculator builds on three foundational ideas. Understanding how they nest inside each other is the key to reading the results accurately.

Token

The unit AI providers use to measure text. Roughly ¾ of a word in English — "ModelPricing" is 3 tokens, a 1,000-word article is roughly 1,300 tokens. Providers charge separately for input tokens (text sent to the model) and output tokens (text the model generates). Output tokens cost significantly more because they require more computation.

Workflow

One complete task a user triggers — e.g. "analyse this contract", "book an appointment", "qualify this lead". A workflow might involve multiple model calls, tool lookups, and document reads. It is the natural unit for measuring usage volume: how many tasks does your system run per month?

Model call

A single API request to the language model. One workflow often requires several calls — for example: extract data → reason about it → generate a reply → check the output. Each call consumes input and output tokens and is billed separately. Multi-step agents typically have 2–5 calls per workflow.

Retry overhead

Not all calls succeed first time. Some fail due to rate limits, JSON parse errors, or model refusals — triggering an automatic retry. Retry overhead % adds a fraction of extra calls on top of your expected call count. A 5% retry rate means 5 extra calls per 100 attempts.

These nest like this:

Monthly users

Workflows / user

Monthly workflows

Calls / workflow

(1 + retry%)

Total model calls

Tokens per call

Total tokens

LLM inference cost

This is the core charge — what you pay the AI provider for each model call. Providers charge per million tokens, with separate rates for input and output. The full formula, before cache and batch adjustments:

Step 1 — volume

calls=workflows × calls per workflow × (1 + retry ÷ 100)

input tokens=calls × tokens per call_in

output tokens=calls × tokens per call_out

Step 2 — inference cost

LLM cost = (input tokens ÷ 1,000,000 × rate_in) + (output tokens ÷ 1,000,000 × rate_out)

Rates are in USD per 1 million tokens. For example, Claude Sonnet 4.6 charges $3 per million input tokens and $15 per million output tokens. A call with 2,000 input tokens and 500 output tokens costs: (2,000/1M × $3) + (500/1M × $15) = $0.006 + $0.0075 = $0.0135.

Cache hits and batch processing

Two provider features can significantly reduce LLM costs. The calculator applies both as multipliers on top of the base inference cost.

Prompt caching

Cache hit rate %

When the same system prompt or context appears in repeated calls, providers can cache the input and charge a fraction of the normal input rate (typically 10–20× cheaper). The cache hit rate is the percentage of your input tokens that get served from cache rather than reprocessed. A customer support agent with a fixed knowledge base might see 30–50% cache hits; a general assistant with unique per-user context might see 5–15%.

Cache adjustment

cached tokens=input tokens × (hit rate ÷ 100)

uncached tokens=input tokens − cached tokens

LLM cost = (uncached ÷ 1M × rate_in) + (cached ÷ 1M × rate_cached) + (output ÷ 1M × rate_out)

Batch processing

Batch share %

Providers offer a batch API for non-real-time work — jobs submitted in bulk and completed within hours rather than seconds. Batch pricing typically gives a 50% discount. The batch share is the percentage of your total calls that can tolerate this delay. Report generation, nightly summarisation, and data enrichment pipelines are good candidates; live chat is not.

Batch discount rate %

The discount applied to batch calls. Most providers offer 50%. This is pre-filled from the model preset but editable if your agreement differs. The discount applies to the entire call cost (input + output), not just one tier.

Batch adjustment

batch multiplier=1 − (batch share ÷ 100 × discount ÷ 100)

LLM cost=LLM cost × batch multiplier

Example: 30% batch share, 50% discount → multiplier = 1 − (0.30 × 0.50) = 0.85 → 15% saving overall

Search tool costs

Many AI agents call external search APIs (Brave, Tavily, Bing) to retrieve up-to-date information. This creates two separate charges: one for the search call itself, and one for the retrieved content that gets injected into the model's context.

Web searches per workflow

The average number of search API calls made per workflow. A simple chatbot that sometimes looks things up might average 0.3. A research agent that always searches might average 3–5. Set to 0 if your system does not use web search.

Search content tokens

After a search call returns results, those results are pasted into the model's context as additional input tokens. Search content tokens is the average number of tokens retrieved per search call. Search results are typically 1,000–4,000 tokens per call. These are billed at your model's input rate, because they enter the model as input.

Search tool cost

search calls=workflows × searches per workflow

tool cost=(search calls ÷ 1,000) × price per 1,000

Search content cost

content tokens=search calls × tokens per search

content cost=(content tokens ÷ 1M) × rate_in

Embeddings

Embedding

A numerical representation of text — a list of numbers (a "vector") that captures the meaning of a piece of text. Embeddings are used to power semantic search and RAG (retrieval-augmented generation) — finding relevant documents from a knowledge base based on meaning, not keyword matching. Embedding models are separate from language models and typically much cheaper.

Monthly embedding tokens

The total number of tokens you embed each month — both for indexing your knowledge base and for embedding user queries at runtime. A 10,000-document knowledge base with 500 tokens per doc = 5M tokens to index (done once). Embedding each user query might add another 100 tokens × number of queries per month.

Embedding cost

embedding cost=(monthly tokens ÷ 1,000,000) × price per million

OCR (document processing)

OCR — Optical Character Recognition

Converting images or scanned PDFs into machine-readable text. AI document pipelines (legal, finance, healthcare) often need to extract text from uploaded PDFs before the language model can read them. OCR APIs (e.g. AWS Textract, Google Document AI) charge per page processed. Set to 0 if your system only handles plain text or already-digital documents.

OCR pages per workflow

The average number of document pages processed in a single workflow. A contract review agent might process 8–15 pages per run. A simple form processor might handle 1–2 pages. The total monthly OCR pages = workflows × pages per workflow.

OCR cost

OCR pages=workflows × pages per workflow

OCR cost=OCR pages × price per page

Operating overhead

These are the infrastructure costs that keep your system running — independent of how many workflows you process. Because they depend entirely on your vendor choices, the calculator takes your estimates directly rather than computing them.

Fixed infrastructure cost

Your monthly server, hosting, and platform costs — regardless of traffic. This includes: compute (EC2, Cloud Run, Fly.io), API gateway, monitoring (Datadog, Sentry), CI/CD pipelines, and any third-party orchestration services (LangSmith, Helicone). Does not scale with usage.

Vector storage cost

If your system uses a vector database for semantic search (Pinecone, Weaviate, Qdrant, pgvector), it has a monthly storage cost. This is typically flat-rate at low scale and scales with the size of your index. Included as a separate field because it is often the second-largest infrastructure line item in RAG systems.

Contingency buffer %

A safety margin applied to the entire cost estimate. Real usage rarely matches projections exactly — token counts vary, traffic spikes, caching behaves differently than expected. A 10–15% contingency is standard for planning purposes. It does not represent a real charge — it is a budget cushion.

Operating cost

operating cost=fixed infra + vector storage

Contingency

subtotal=API cost + operating cost

contingency=subtotal × (rate ÷ 100)

Putting it all together

The full cost calculation stacks each component in order:

API / usage costs

API cost = LLM inference + search tool + search content + embeddings + OCR

Full stack (P&L view — no contingency)

operating cost=fixed infra + vector storage

P&L total=API cost + operating cost

Budget / planning view

contingency=P&L total × contingency %

planning total=P&L total + contingency

yearly run-rate=planning total × 12

Scenario ranges: P10 / P50 / P90

A single number gives false precision. The calculator runs three scenarios using multipliers on your volume inputs — rates stay fixed. These are labelled P10 / P50 / P90 to indicate approximate percentile positions. The distribution is right-skewed: P90 is further from the median than P10, reflecting the real-world tendency for AI costs to blow out more than they collapse. P50 is the median estimate, not a probability-weighted mean.

Scenario	Workflow volume	Token counts	Search calls	Retry rate	Fixed costs
P10 (low)	70% of P50	85% of P50	70% of P50	50% of P50	90% of P50
P50 (median)	100%	100%	100%	100%	100%
P90 (heavy)	150% of P50	120% of P50	140% of P50	200% of P50	110% of P50

The bar chart shows all three side by side. For budget approval, use the P90 figure — it covers a high-usage month. For unit economics and pricing decisions, use P50.

Outputs and pricing metrics

API cost / month

What you would owe AI providers at P50 volume. Calculated from published rates × your volume estimates. The rates are sourced from provider pricing pages; the volumes (tokens per call, retry %, cache-hit %) are your estimates and carry the same uncertainty as operating overhead inputs.

API cost per workflow

API cost divided by total monthly workflows. The most useful unit for product decisions — tells you what one task costs in AI charges before overhead. Use this to compare models or price individual features.

Operating overhead / month

Your fixed infrastructure costs (infra + vector storage). Based on your own estimates — the calculator does not compute these. Does not include contingency, which is shown separately in the budget view.

Total estimate / month

Planning total: API cost + operating overhead + contingency buffer. Used for budget approval. Profit, gross margin, and break-even use the P&L total (without contingency) — because contingency is a planning reserve, not a realized expense.

Cost per active user

(API cost + operating overhead) ÷ monthly active users. Excludes contingency. This is your unit economics baseline — the minimum revenue per user needed for the product to cover its direct costs.

Price for target gross margin

Uses only variable API costs per user (the correct COGS for gross margin). Formula: variable_cost_per_user / (1 − target_GM). This matches how gross margin is defined in a standard P&L — revenue minus direct cost of revenue, divided by revenue. It excludes fixed infra, monitoring, and contingency, which belong below the gross margin line. For a full operating-margin price, add fixed cost allocation on top.

Break-even users (CVP)

Uses the standard cost-volume-profit formula: fixed costs ÷ (revenue per user − variable cost per user). This is the volume at which contribution margin covers fixed costs. If variable cost per user exceeds revenue per user, there is no break-even — the calculator surfaces this as a warning. Note: excludes contingency from both fixed and variable costs.

Gross margin %

Only shown when you enter a revenue per user. (revenue − API costs) / revenue × 100. This is true gross margin — revenue minus variable COGS only. Fixed infra and contingency sit below the gross margin line. A SaaS business typically targets 60–80% gross margin on this basis.

Yearly run-rate

Planning total × 12. This is a flat run-rate, not a projection. It does not account for provider price changes (historically 20–40% annual decline), user growth, or discounting. Use it as a rough annual budget ceiling, not a financial forecast.

These are estimates, not invoices. Actual costs depend on your specific implementation, traffic patterns, and provider plan. Always verify rates directly on provider pricing pages before committing to a budget. Prices change — sometimes without notice.

Pricing data sources

All prices are sourced directly from official provider documentation and pricing pages. We do not use third-party aggregators as a primary source.

Provider	Source
OpenAI	openai.com/api/pricing
Anthropic	anthropic.com/pricing
Google	cloud.google.com/vertex-ai pricing
AWS Bedrock	aws.amazon.com/bedrock/pricing

Normalisation rules

To allow fair comparison across providers, all prices are standardised as follows:

All rates expressed as USD per 1 million tokens.
Input and output prices tracked separately — they differ significantly between models.
Cached input rates tracked where officially documented by the provider.
Batch pricing discounts documented but not used as the primary comparison figure — real-time rates are shown by default.
Context window sizes quoted in tokens, not words or characters.

Update process

Prices are reviewed manually against official documentation. We aim to reflect provider price changes within 7 days of announcement. Major changes (new model releases, significant price drops) are typically reflected within 48 hours.

The calculator shows a "Prices last updated" timestamp in the pricing overrides section. The models comparison page subtitle indicates the month of the most recent full review.

ModelPricing is an independent resource. Editorial decisions — which models to include, how to describe them — are not influenced by providers or affiliate relationships. Affiliate partnerships cover tool recommendations only, not model pricing data.

Limitations

Only publicly listed API prices are tracked. Enterprise, committed-use, or negotiated discounts are not reflected.
Prices can change without notice. Always verify at the official provider pricing page before committing to a budget.
Token counts are estimates. Actual token usage depends on the specific text, model tokeniser, and implementation. Test with real data before scaling.
The calculator uses a simplified cost model. Real production systems may have additional cost vectors not modelled here (e.g. image generation, audio transcription, fine-tuning).
We are not affiliated with OpenAI, Anthropic, or any AI provider. We have no advance notice of pricing changes.

Questions or corrections? contact@elythra.studio — or open the calculator and start with one of the presets to see the numbers in action.

← Back to the calculator