Token Management Playbook — Boots On The Ground AI

What Is a Token?

A token is a chunk of text that an AI model processes. It's not a character, and it's not a full word. It's somewhere in between.

Think of it like this: the AI doesn't read words the way you do. It breaks everything into small pieces called tokens, then processes those pieces.

How big is a token?

1 token is roughly 3-4 characters of English text
1 word is usually 1-2 tokens
100 tokens is roughly 75 words
1,000 tokens is roughly 750 words (about 1.5 pages)

// How the AI sees your text:

"Hello, how are you today?"

Token 1: "Hello"
Token 2: ","
Token 3: " how"
Token 4: " are"
Token 5: " you"
Token 6: " today"
Token 7: "?"

// 7 words = 7 tokens (simple sentence)
// But "uncomfortable" = 3 tokens: "un" + "comfort" + "able"
    

💡 Key Insight

Simple, common words use fewer tokens. Technical jargon, code, and non-English text use more. The word "API" is 1 token, but "implementation" might be 2-3 tokens.

Tokens = Money

Every time you send a message to an AI model through an API, you're paying for tokens. Every. Single. Time.

AI providers charge per token — usually priced per 1 million tokens (MTok). The formula is dead simple:

Your Cost = (Input Tokens x Input Price) + (Output Tokens x Output Price) Example with Claude Sonnet: You send: 2,000 tokens x $3.00/MTok = $0.006 AI replies: 1,000 tokens x $15.00/MTok = $0.015 ───────────── Total: $0.021 per exchange

That seems tiny. But multiply it:

Scenario	Exchanges/Day	Daily Cost	Monthly Cost
Light personal use	50	$1.05	$31.50
Dev team (5 people)	500	$10.50	$315
Customer-facing chatbot	5,000	$105	$3,150
Heavy production app	50,000	$1,050	$31,500

🚨 Wake-Up Call

A single poorly-designed AI chatbot can burn through $1,000+ per month without you realizing it. Most people don't check their API usage until they get the bill.

Input vs Output Tokens

This is a detail most people miss: input tokens and output tokens are priced differently.

Input tokens = what YOU send to the AI (your question, context, system prompt, conversation history)
Output tokens = what the AI sends BACK to you (the response)

Output tokens almost always cost 3-5x more than input tokens.

Model	Input Price (per MTok)	Output Price (per MTok)	Output Multiplier
GPT-4o	$2.50	$10.00	4x more
Claude Sonnet 4.5	$3.00	$15.00	5x more
Claude Opus 4	$15.00	$75.00	5x more
Claude Haiku 3.5	$0.80	$4.00	5x more
GPT-4o mini	$0.15	$0.60	4x more

✅ Pro Tip

If your AI is writing long, verbose responses and you only need a short answer — you're overpaying on output tokens. Tell the model to be concise. A simple instruction like "Answer in 2-3 sentences" can cut your output costs by 80%.

The Context Window

The context window is the total amount of text (in tokens) that a model can "see" at one time. Think of it as the AI's working memory.

Model	Context Window	Roughly Equivalent To
GPT-4o	128K tokens	~200 pages / a short novel
Claude Sonnet 4.5	200K tokens	~300 pages / a full novel
Gemini 1.5 Pro	2M tokens	~3,000 pages / several textbooks

What happens when you hit the limit?

The AI doesn't crash — it just starts forgetting the oldest parts of the conversation. This is called "falling out of context." The AI silently drops your earlier messages to make room for new ones.

⚠️ Warning

A 200K context window doesn't mean you should USE all 200K tokens. A bigger context means a bigger bill. If you stuff 100K tokens of context into every request, you're paying for 100K input tokens every single time you send a message.

The trap: Just because a model CAN handle 200K tokens doesn't mean it SHOULD. Performance degrades on very long contexts. The model may "lose focus" on important details buried in the middle of a massive context.

Long Conversations: The Hidden Cost

This is the single biggest gotcha for most people. Here's what actually happens in a conversation with an AI:

Message 1: You send 100 tokens ──> AI reads 100 tokens Message 2: You send 100 tokens ──> AI reads 300 tokens (msg 1 + reply + msg 2) Message 3: You send 100 tokens ──> AI reads 600 tokens (all history + msg 3) Message 4: You send 100 tokens ──> AI reads 1,000 tokens (all history + msg 4) ... Message 20: You send 100 tokens ──> AI reads 10,000+ tokens (ENTIRE conversation)

Every message re-sends the ENTIRE conversation history. The AI doesn't "remember" — it re-reads everything from scratch each time.

The snowball effect

A 20-message conversation doesn't cost 20x a single message. It costs closer to 200x because each message includes all previous messages as input.

🚨 The #1 Money Pit

A single long conversation with a powerful model can easily cost $1-5+ in tokens. If you have users running long conversations with your AI product, this adds up to thousands per month — fast.

What to do about it

Start new conversations for new topics instead of continuing old ones
Summarize and reset — periodically summarize the conversation and start fresh with the summary
Set a conversation length limit in your apps (e.g., max 20 messages, then suggest starting a new chat)
Only include relevant history — don't send the full conversation if the user's question doesn't need it

Model Selection & Pricing

Not every task needs the most powerful (and expensive) model. Choosing the right model for the job is the easiest way to cut costs.

The model hierarchy

Haiku / Mini

Sonnet / GPT-4o

Opus / o1

$$$$$

When to use what

Task	Best Model Tier	Why
Classify text, extract data, simple Q&A	Haiku / Mini	Fast, cheap, good enough
Write content, code, analysis	Sonnet / GPT-4o	Great quality, reasonable cost
Complex reasoning, architecture, research	Opus / o1	Best quality, premium cost
Summarize text, format data	Haiku / Mini	Don't overpay for simple tasks

✅ Pro Tip

Route by task complexity. In production apps, use a small model to classify the request first, then route complex requests to a powerful model and simple ones to a cheap model. This alone can cut costs 50-70%.

System Prompts: The Silent Token Eaters

A system prompt is the hidden instruction you give the AI before the user ever types anything. Things like "You are a helpful customer service agent for Acme Corp..."

Here's the problem: the system prompt is sent with EVERY single message.

Every API call includes: ┌─────────────────────────────────┐ │ System Prompt (500 tokens) │ ← Sent EVERY time │ Conversation History (varies) │ ← Growing EVERY time │ User's New Message (100 tokens)│ ← The actual question └─────────────────────────────────┘ If your system prompt is 2,000 tokens and a user sends 50 messages: System prompt alone = 2,000 x 50 = 100,000 tokens That's $0.30 just for the system prompt (Sonnet pricing)

How to fix it

Keep system prompts short. Every word counts — literally. Cut the fluff.
Use caching (see next section) to avoid re-processing the same system prompt
Don't put examples in the system prompt unless absolutely necessary — they're expensive to repeat
Move static instructions to cached context instead of the system prompt

⚠️ Common Mistake

Pasting your entire company FAQ, product docs, or lengthy persona description into the system prompt. A 5,000-token system prompt costs you $0.015 per message just for the prompt itself — before the user even says anything.

Caching: Stop Paying Twice

Prompt caching is one of the most powerful cost-saving features available. The concept is simple: if you're sending the same content repeatedly, cache it so you only pay full price once.

How prompt caching works

Without Caching: Request 1: [System Prompt + Context] → Full price Request 2: [System Prompt + Context] → Full price (again!) Request 3: [System Prompt + Context] → Full price (again!!) With Caching: Request 1: [System Prompt + Context] → Full price (cached) Request 2: [Cached ✓] + new message → 90% discount on cached part Request 3: [Cached ✓] + new message → 90% discount on cached part

Caching savings by provider

Provider	Cache Write Cost	Cache Read Cost	Savings
Anthropic (Claude)	1.25x base price (once)	0.1x base price	90% on reads
OpenAI (GPT)	Free (automatic)	0.5x base price	50% on reads
Google (Gemini)	Free	0.25x base price	75% on reads

What should you cache?

System prompts — sent with every request, perfect for caching
Long documents — if users are asking questions about the same document
Few-shot examples — the examples you include to show the AI what you want
Conversation history prefix — the earlier parts of a conversation that don't change

✅ Real Savings Example

A customer support bot with a 3,000-token system prompt handling 1,000 messages/day:
Without caching: $9.00/day in system prompt costs alone
With caching: $0.90/day — saving $243/month from one simple change.

Setting Limits & Budgets

Every AI provider gives you tools to set spending limits. Use them. An uncapped API key is a ticking time bomb.

Limits you should set immediately

Limit Type	What It Does	Where to Set It
Monthly budget cap	Hard stop when you hit $X/month	Provider dashboard
Rate limit	Max requests per minute/hour	Provider dashboard or your code
Max tokens per request	Limit how long the AI's response can be	API parameter: max_tokens
Max conversation length	Cap how many messages before reset	Your application code
Per-user daily limit	Prevent one user from burning your budget	Your application code
Alert thresholds	Email you when spending hits 50%, 80%	Provider dashboard

// Example: Setting max_tokens in an API call
// This limits the AI's response length

const response = await anthropic.messages.create({
  model: "claude-sonnet-4-5-20250929",
  max_tokens: 500,  // AI can only respond with 500 tokens max
  messages: [{ role: "user", content: userMessage }]
});

// Without max_tokens, the AI might write 4,000 tokens when you only needed 200
// That's 8x the output cost — for nothing
    

🚨 Horror Story

A developer left an API key in a public repo. A bot found it and ran thousands of requests. The bill: $14,000 in one weekend. Always set budget caps, always rotate exposed keys, and always monitor usage.

Images, Files & Hidden Token Costs

Text isn't the only thing that costs tokens. Many people are shocked by how many tokens images and files consume.

Image token costs

Image Size	Approximate Tokens	Cost (Sonnet)
Small thumbnail (100x100)	~200 tokens	$0.0006
Medium image (500x500)	~1,000 tokens	$0.003
Large image (1000x1000)	~1,600 tokens	$0.005
High-res photo (2000x2000)	~3,200+ tokens	$0.01+
Screenshot (1920x1080)	~2,500 tokens	$0.008

Other hidden token costs

PDF files: Converted to text, can be thousands of tokens per page
Code files: Code is token-heavy because of syntax, indentation, and special characters
JSON/XML: All those brackets, keys, and formatting? Tokens. A 1KB JSON blob can be 300+ tokens
Tool/function definitions: If you're using tool use or function calling, those schemas are sent as tokens every time

⚠️ Watch Out

Sending 5 screenshots in one message could cost 12,000+ input tokens — that's more than a whole page of text. Resize images before sending them to the AI, or describe what's in the image instead.

Streaming vs Batch

Streaming

Streaming shows the AI's response word-by-word as it's generated (like how ChatGPT types in real time). Streaming costs the same in tokens — it doesn't save money. But it feels faster to users because they see output immediately.

Batch processing

If you have hundreds or thousands of requests that don't need instant answers, batch APIs offer 50% discounts.

Use Case	Best Approach	Why
Live chatbot	Streaming	Users need instant responses
Processing 1,000 documents	Batch	50% cost savings, no rush
Nightly report generation	Batch	Save money, run overnight
Interactive code assistant	Streaming	Developers want real-time output

✅ Pro Tip

Anthropic's Batch API gives you 50% off and processes within 24 hours. If your workload can wait, this is free money.

The Gotchas Nobody Tells You

These are the things that catch people off guard. Bookmark this section.

1. Retries multiply your cost

If your code automatically retries failed requests, you're paying for every attempt. Three retries = 3x the cost for one answer. Always implement exponential backoff and set a retry limit.

2. "Temperature" doesn't affect cost, but it affects waste

Higher temperature = more creative but sometimes nonsensical responses. If the AI gives a bad answer and the user has to ask again, you just paid double.

3. Empty or error responses still cost tokens

If the AI returns an error or a useless response, you still paid for the input tokens. Validate inputs before sending them.

4. Thinking tokens (extended thinking / chain-of-thought)

Some models now support "thinking" or "reasoning" modes where the AI works through a problem step by step. Those thinking tokens count as output tokens — the most expensive kind. A model "thinking" for 5,000 tokens before giving a 200-token answer means you're paying for 5,200 output tokens.

5. Conversation forking multiplies costs

If a user edits an earlier message (like in ChatGPT or Claude), the AI re-processes everything from that point forward. That's a whole new conversation branch, paid in full.

6. Tool use / function calling adds tokens

Every tool definition you give the AI is sent as tokens. 10 tools with complex schemas can add 2,000-5,000 tokens to every request — before the user even says anything.

7. The "helpful" AI problem

AI models love to be thorough. Ask a yes/no question, get a 500-word essay. That's 500 output tokens you didn't need. Be specific in your prompts: "Answer with only yes or no."

🚨 The Biggest Gotcha

You can't un-send tokens. Once the API call is made, you're charged — even if you cancel the stream mid-response, even if the answer is wrong, even if your app crashes before showing it to the user. Design defensively.

Token Counting Tools

You don't have to guess how many tokens something is. Use these tools:

Tool	Works With	Type
Anthropic Token Counter (API)	Claude models	API endpoint
OpenAI Tokenizer (tiktoken)	GPT models	Python library / web tool
Anthropic Console	Claude	Usage dashboard
OpenAI Usage Dashboard	GPT models	Web dashboard
LLM Price Check (llm-price.com)	All models	Price comparison website

✅ Pro Tip

Check your provider's usage dashboard weekly. Set up email alerts at 50% and 80% of your budget. Surprises are expensive in the token world.

Cost Estimation Cheat Sheet

Quick reference for estimating costs before you build:

Content Type	Approx. Tokens	Real-World Example
A tweet (280 chars)	~50 tokens	Quick classification or sentiment
A paragraph	~100-150 tokens	Summary request
An email	~200-500 tokens	Draft or reply generation
A full page of text	~500-700 tokens	Document analysis
A blog post	~1,000-3,000 tokens	Content generation
A code file (200 lines)	~1,500-2,500 tokens	Code review or debugging
A 10-page PDF	~5,000-7,000 tokens	Document Q&A
A book chapter	~10,000-15,000 tokens	Long-form analysis

💡 Quick Math

Rule of thumb: Take your word count, multiply by 1.3, and you have a rough token estimate. For code, multiply by 1.5-2x because of syntax characters.

Provider Comparison

Each provider does things slightly differently. Here's what matters:

Feature	Anthropic (Claude)	OpenAI (GPT)	Google (Gemini)
Top model context	200K tokens	128K tokens	2M tokens
Prompt caching	90% savings	50% savings	75% savings
Batch discount	50% off	50% off	Varies
Budget caps	Yes (dashboard)	Yes (dashboard)	Yes (dashboard)
Free tier	Limited	Limited	Generous
Cheapest model	Haiku ($0.80/MTok in)	GPT-4o mini ($0.15/MTok in)	Flash ($0.075/MTok in)

💡 Key Takeaway

No single provider is cheapest for everything. Google Gemini has the largest context window and cheapest small models. Anthropic has the best caching savings. OpenAI has the broadest ecosystem. Pick based on your specific use case.

Real-World Scenarios

Scenario 1: "I just want to build a chatbot for my small business"

You build a customer support chatbot. 50 customers/day, average 8 messages each.

// The math
System prompt:                1,000 tokens
Average conversation length:  8 messages
Average tokens per exchange:  ~2,000 tokens (input + output, growing)
Total per conversation:       ~12,000 tokens
Daily (50 conversations):     600,000 tokens
Monthly:                      ~18M tokens

// Cost with Claude Sonnet (blended rate ~$6/MTok)
Monthly cost: ~$108/month

// With caching + Haiku for simple questions (70% of traffic)
Monthly cost: ~$25/month  ← Smart routing saves $83/month
    

Scenario 2: "I'm using AI to process my company's documents"

You upload 500 documents (average 5 pages each) for analysis.

// The math
500 documents x 5 pages x 700 tokens/page = 1,750,000 input tokens
Analysis output per doc (~500 tokens):       250,000 output tokens

// Cost with Claude Sonnet
Input:  1.75M x $3/MTok  = $5.25
Output: 0.25M x $15/MTok = $3.75
Total: $9.00

// With Batch API (50% off)
Total: $4.50

// With Haiku instead (if quality is sufficient)
Total: $1.40
    

Scenario 3: "My dev team uses AI coding assistants all day"

5 developers, each making ~100 AI requests per day with code context.

// The math
Average request: 3,000 tokens input + 1,500 tokens output
Per developer per day: 100 requests
Team daily tokens: 2.25M tokens
Team monthly tokens: ~67.5M tokens

// Cost with Claude Sonnet
Input:  45M x $3/MTok   = $135
Output: 22.5M x $15/MTok = $337
Monthly: $472/month

// With prompt caching (system prompt + common context)
Monthly: ~$280/month  ← Caching saves ~$190/month
    

🏆 10 Golden Rules of Token Management

Set budget caps on day one. Every provider lets you set spending limits. Do it before you write a single line of code. An uncapped API key is an unlimited credit card left on a park bench.

Long conversations are expensive conversations. Every message re-sends the entire history. Start new conversations for new topics. Summarize and reset when conversations get long.

Use the smallest model that gets the job done. Don't use Opus/GPT-4 for tasks that Haiku/GPT-4o-mini handles fine. Route by complexity — simple tasks to cheap models, hard tasks to powerful ones.

Cache everything you repeat. System prompts, document context, examples — if it's sent more than once, cache it. This alone can cut costs 50-90%.

Output tokens cost 3-5x more than input. Tell the AI to be concise. Set max_tokens on every request. A shorter response isn't just faster — it's cheaper.

Keep system prompts lean. Every token in your system prompt is charged on every single message. Cut the fluff. Be precise. Your wallet will thank you.

Monitor usage weekly, not monthly. Check your provider dashboard every week. Set alerts at 50% and 80% of budget. Catching a runaway cost on day 7 is better than finding out on day 30.

Images and files are token-heavy. A single screenshot can cost more tokens than a full page of text. Resize images, extract text when possible, and only send what's needed.

Use batch processing for non-urgent work. If it can wait hours instead of seconds, use the Batch API for 50% savings. Nightly reports, document processing, data analysis — batch it all.

Tokens sent are tokens paid — no refunds. Cancelled streams, failed requests, bad prompts — you pay for all of it. Validate inputs, test prompts thoroughly, and design for efficiency from the start.