What Is a Token?
A token is a chunk of text that an AI model processes. It's not a character, and it's not a full word. It's somewhere in between.
Think of it like this: the AI doesn't read words the way you do. It breaks everything into small pieces called tokens, then processes those pieces.
How big is a token?
- 1 token is roughly 3-4 characters of English text
- 1 word is usually 1-2 tokens
- 100 tokens is roughly 75 words
- 1,000 tokens is roughly 750 words (about 1.5 pages)
Simple, common words use fewer tokens. Technical jargon, code, and non-English text use more. The word "API" is 1 token, but "implementation" might be 2-3 tokens.
Tokens = Money
Every time you send a message to an AI model through an API, you're paying for tokens. Every. Single. Time.
AI providers charge per token — usually priced per 1 million tokens (MTok). The formula is dead simple:
That seems tiny. But multiply it:
| Scenario | Exchanges/Day | Daily Cost | Monthly Cost |
|---|---|---|---|
| Light personal use | 50 | $1.05 | $31.50 |
| Dev team (5 people) | 500 | $10.50 | $315 |
| Customer-facing chatbot | 5,000 | $105 | $3,150 |
| Heavy production app | 50,000 | $1,050 | $31,500 |
A single poorly-designed AI chatbot can burn through $1,000+ per month without you realizing it. Most people don't check their API usage until they get the bill.
Input vs Output Tokens
This is a detail most people miss: input tokens and output tokens are priced differently.
- Input tokens = what YOU send to the AI (your question, context, system prompt, conversation history)
- Output tokens = what the AI sends BACK to you (the response)
Output tokens almost always cost 3-5x more than input tokens.
| Model | Input Price (per MTok) | Output Price (per MTok) | Output Multiplier |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | 4x more |
| Claude Sonnet 4.5 | $3.00 | $15.00 | 5x more |
| Claude Opus 4 | $15.00 | $75.00 | 5x more |
| Claude Haiku 3.5 | $0.80 | $4.00 | 5x more |
| GPT-4o mini | $0.15 | $0.60 | 4x more |
If your AI is writing long, verbose responses and you only need a short answer — you're overpaying on output tokens. Tell the model to be concise. A simple instruction like "Answer in 2-3 sentences" can cut your output costs by 80%.
The Context Window
The context window is the total amount of text (in tokens) that a model can "see" at one time. Think of it as the AI's working memory.
| Model | Context Window | Roughly Equivalent To |
|---|---|---|
| GPT-4o | 128K tokens | ~200 pages / a short novel |
| Claude Sonnet 4.5 | 200K tokens | ~300 pages / a full novel |
| Gemini 1.5 Pro | 2M tokens | ~3,000 pages / several textbooks |
What happens when you hit the limit?
The AI doesn't crash — it just starts forgetting the oldest parts of the conversation. This is called "falling out of context." The AI silently drops your earlier messages to make room for new ones.
A 200K context window doesn't mean you should USE all 200K tokens. A bigger context means a bigger bill. If you stuff 100K tokens of context into every request, you're paying for 100K input tokens every single time you send a message.
The trap: Just because a model CAN handle 200K tokens doesn't mean it SHOULD. Performance degrades on very long contexts. The model may "lose focus" on important details buried in the middle of a massive context.
Long Conversations: The Hidden Cost
This is the single biggest gotcha for most people. Here's what actually happens in a conversation with an AI:
Every message re-sends the ENTIRE conversation history. The AI doesn't "remember" — it re-reads everything from scratch each time.
The snowball effect
A 20-message conversation doesn't cost 20x a single message. It costs closer to 200x because each message includes all previous messages as input.
A single long conversation with a powerful model can easily cost $1-5+ in tokens. If you have users running long conversations with your AI product, this adds up to thousands per month — fast.
What to do about it
- Start new conversations for new topics instead of continuing old ones
- Summarize and reset — periodically summarize the conversation and start fresh with the summary
- Set a conversation length limit in your apps (e.g., max 20 messages, then suggest starting a new chat)
- Only include relevant history — don't send the full conversation if the user's question doesn't need it
Model Selection & Pricing
Not every task needs the most powerful (and expensive) model. Choosing the right model for the job is the easiest way to cut costs.
The model hierarchy
When to use what
| Task | Best Model Tier | Why |
|---|---|---|
| Classify text, extract data, simple Q&A | Haiku / Mini | Fast, cheap, good enough |
| Write content, code, analysis | Sonnet / GPT-4o | Great quality, reasonable cost |
| Complex reasoning, architecture, research | Opus / o1 | Best quality, premium cost |
| Summarize text, format data | Haiku / Mini | Don't overpay for simple tasks |
Route by task complexity. In production apps, use a small model to classify the request first, then route complex requests to a powerful model and simple ones to a cheap model. This alone can cut costs 50-70%.
System Prompts: The Silent Token Eaters
A system prompt is the hidden instruction you give the AI before the user ever types anything. Things like "You are a helpful customer service agent for Acme Corp..."
Here's the problem: the system prompt is sent with EVERY single message.
How to fix it
- Keep system prompts short. Every word counts — literally. Cut the fluff.
- Use caching (see next section) to avoid re-processing the same system prompt
- Don't put examples in the system prompt unless absolutely necessary — they're expensive to repeat
- Move static instructions to cached context instead of the system prompt
Pasting your entire company FAQ, product docs, or lengthy persona description into the system prompt. A 5,000-token system prompt costs you $0.015 per message just for the prompt itself — before the user even says anything.
Caching: Stop Paying Twice
Prompt caching is one of the most powerful cost-saving features available. The concept is simple: if you're sending the same content repeatedly, cache it so you only pay full price once.
How prompt caching works
Caching savings by provider
| Provider | Cache Write Cost | Cache Read Cost | Savings |
|---|---|---|---|
| Anthropic (Claude) | 1.25x base price (once) | 0.1x base price | 90% on reads |
| OpenAI (GPT) | Free (automatic) | 0.5x base price | 50% on reads |
| Google (Gemini) | Free | 0.25x base price | 75% on reads |
What should you cache?
- System prompts — sent with every request, perfect for caching
- Long documents — if users are asking questions about the same document
- Few-shot examples — the examples you include to show the AI what you want
- Conversation history prefix — the earlier parts of a conversation that don't change
A customer support bot with a 3,000-token system prompt handling 1,000 messages/day:
Without caching: $9.00/day in system prompt costs alone
With caching: $0.90/day — saving $243/month from one simple change.
Setting Limits & Budgets
Every AI provider gives you tools to set spending limits. Use them. An uncapped API key is a ticking time bomb.
Limits you should set immediately
| Limit Type | What It Does | Where to Set It |
|---|---|---|
| Monthly budget cap | Hard stop when you hit $X/month | Provider dashboard |
| Rate limit | Max requests per minute/hour | Provider dashboard or your code |
| Max tokens per request | Limit how long the AI's response can be | API parameter: max_tokens |
| Max conversation length | Cap how many messages before reset | Your application code |
| Per-user daily limit | Prevent one user from burning your budget | Your application code |
| Alert thresholds | Email you when spending hits 50%, 80% | Provider dashboard |
A developer left an API key in a public repo. A bot found it and ran thousands of requests. The bill: $14,000 in one weekend. Always set budget caps, always rotate exposed keys, and always monitor usage.
Images, Files & Hidden Token Costs
Text isn't the only thing that costs tokens. Many people are shocked by how many tokens images and files consume.
Image token costs
| Image Size | Approximate Tokens | Cost (Sonnet) |
|---|---|---|
| Small thumbnail (100x100) | ~200 tokens | $0.0006 |
| Medium image (500x500) | ~1,000 tokens | $0.003 |
| Large image (1000x1000) | ~1,600 tokens | $0.005 |
| High-res photo (2000x2000) | ~3,200+ tokens | $0.01+ |
| Screenshot (1920x1080) | ~2,500 tokens | $0.008 |
Other hidden token costs
- PDF files: Converted to text, can be thousands of tokens per page
- Code files: Code is token-heavy because of syntax, indentation, and special characters
- JSON/XML: All those brackets, keys, and formatting? Tokens. A 1KB JSON blob can be 300+ tokens
- Tool/function definitions: If you're using tool use or function calling, those schemas are sent as tokens every time
Sending 5 screenshots in one message could cost 12,000+ input tokens — that's more than a whole page of text. Resize images before sending them to the AI, or describe what's in the image instead.
Streaming vs Batch
Streaming
Streaming shows the AI's response word-by-word as it's generated (like how ChatGPT types in real time). Streaming costs the same in tokens — it doesn't save money. But it feels faster to users because they see output immediately.
Batch processing
If you have hundreds or thousands of requests that don't need instant answers, batch APIs offer 50% discounts.
| Use Case | Best Approach | Why |
|---|---|---|
| Live chatbot | Streaming | Users need instant responses |
| Processing 1,000 documents | Batch | 50% cost savings, no rush |
| Nightly report generation | Batch | Save money, run overnight |
| Interactive code assistant | Streaming | Developers want real-time output |
Anthropic's Batch API gives you 50% off and processes within 24 hours. If your workload can wait, this is free money.
The Gotchas Nobody Tells You
These are the things that catch people off guard. Bookmark this section.
1. Retries multiply your cost
If your code automatically retries failed requests, you're paying for every attempt. Three retries = 3x the cost for one answer. Always implement exponential backoff and set a retry limit.
2. "Temperature" doesn't affect cost, but it affects waste
Higher temperature = more creative but sometimes nonsensical responses. If the AI gives a bad answer and the user has to ask again, you just paid double.
3. Empty or error responses still cost tokens
If the AI returns an error or a useless response, you still paid for the input tokens. Validate inputs before sending them.
4. Thinking tokens (extended thinking / chain-of-thought)
Some models now support "thinking" or "reasoning" modes where the AI works through a problem step by step. Those thinking tokens count as output tokens — the most expensive kind. A model "thinking" for 5,000 tokens before giving a 200-token answer means you're paying for 5,200 output tokens.
5. Conversation forking multiplies costs
If a user edits an earlier message (like in ChatGPT or Claude), the AI re-processes everything from that point forward. That's a whole new conversation branch, paid in full.
6. Tool use / function calling adds tokens
Every tool definition you give the AI is sent as tokens. 10 tools with complex schemas can add 2,000-5,000 tokens to every request — before the user even says anything.
7. The "helpful" AI problem
AI models love to be thorough. Ask a yes/no question, get a 500-word essay. That's 500 output tokens you didn't need. Be specific in your prompts: "Answer with only yes or no."
You can't un-send tokens. Once the API call is made, you're charged — even if you cancel the stream mid-response, even if the answer is wrong, even if your app crashes before showing it to the user. Design defensively.
Token Counting Tools
You don't have to guess how many tokens something is. Use these tools:
| Tool | Works With | Type |
|---|---|---|
| Anthropic Token Counter (API) | Claude models | API endpoint |
| OpenAI Tokenizer (tiktoken) | GPT models | Python library / web tool |
| Anthropic Console | Claude | Usage dashboard |
| OpenAI Usage Dashboard | GPT models | Web dashboard |
| LLM Price Check (llm-price.com) | All models | Price comparison website |
Check your provider's usage dashboard weekly. Set up email alerts at 50% and 80% of your budget. Surprises are expensive in the token world.
Cost Estimation Cheat Sheet
Quick reference for estimating costs before you build:
| Content Type | Approx. Tokens | Real-World Example |
|---|---|---|
| A tweet (280 chars) | ~50 tokens | Quick classification or sentiment |
| A paragraph | ~100-150 tokens | Summary request |
| An email | ~200-500 tokens | Draft or reply generation |
| A full page of text | ~500-700 tokens | Document analysis |
| A blog post | ~1,000-3,000 tokens | Content generation |
| A code file (200 lines) | ~1,500-2,500 tokens | Code review or debugging |
| A 10-page PDF | ~5,000-7,000 tokens | Document Q&A |
| A book chapter | ~10,000-15,000 tokens | Long-form analysis |
Rule of thumb: Take your word count, multiply by 1.3, and you have a rough token estimate. For code, multiply by 1.5-2x because of syntax characters.
Provider Comparison
Each provider does things slightly differently. Here's what matters:
| Feature | Anthropic (Claude) | OpenAI (GPT) | Google (Gemini) |
|---|---|---|---|
| Top model context | 200K tokens | 128K tokens | 2M tokens |
| Prompt caching | 90% savings | 50% savings | 75% savings |
| Batch discount | 50% off | 50% off | Varies |
| Budget caps | Yes (dashboard) | Yes (dashboard) | Yes (dashboard) |
| Free tier | Limited | Limited | Generous |
| Cheapest model | Haiku ($0.80/MTok in) | GPT-4o mini ($0.15/MTok in) | Flash ($0.075/MTok in) |
No single provider is cheapest for everything. Google Gemini has the largest context window and cheapest small models. Anthropic has the best caching savings. OpenAI has the broadest ecosystem. Pick based on your specific use case.
Real-World Scenarios
Scenario 1: "I just want to build a chatbot for my small business"
You build a customer support chatbot. 50 customers/day, average 8 messages each.
Scenario 2: "I'm using AI to process my company's documents"
You upload 500 documents (average 5 pages each) for analysis.
Scenario 3: "My dev team uses AI coding assistants all day"
5 developers, each making ~100 AI requests per day with code context.