🪙

TOKEN MANAGEMENT PLAYBOOK

Understanding tokens, costs, caching, and limits across LLMs. Built for the people building with AI who want to stop bleeding money they didn't know they were spending.

16 Sections
10 Golden Rules
Feb 2026
Boots Approved
Table of Contents
1

What Is a Token?

A token is a chunk of text that an AI model processes. It's not a character, and it's not a full word. It's somewhere in between.

Think of it like this: the AI doesn't read words the way you do. It breaks everything into small pieces called tokens, then processes those pieces.

How big is a token?

// How the AI sees your text: "Hello, how are you today?" Token 1: "Hello" Token 2: "," Token 3: " how" Token 4: " are" Token 5: " you" Token 6: " today" Token 7: "?" // 7 words = 7 tokens (simple sentence) // But "uncomfortable" = 3 tokens: "un" + "comfort" + "able"
💡 Key Insight

Simple, common words use fewer tokens. Technical jargon, code, and non-English text use more. The word "API" is 1 token, but "implementation" might be 2-3 tokens.

2

Tokens = Money

Every time you send a message to an AI model through an API, you're paying for tokens. Every. Single. Time.

AI providers charge per token — usually priced per 1 million tokens (MTok). The formula is dead simple:

Your Cost = (Input Tokens x Input Price) + (Output Tokens x Output Price) Example with Claude Sonnet: You send: 2,000 tokens x $3.00/MTok = $0.006 AI replies: 1,000 tokens x $15.00/MTok = $0.015 ───────────── Total: $0.021 per exchange

That seems tiny. But multiply it:

ScenarioExchanges/DayDaily CostMonthly Cost
Light personal use50$1.05$31.50
Dev team (5 people)500$10.50$315
Customer-facing chatbot5,000$105$3,150
Heavy production app50,000$1,050$31,500
🚨 Wake-Up Call

A single poorly-designed AI chatbot can burn through $1,000+ per month without you realizing it. Most people don't check their API usage until they get the bill.

3

Input vs Output Tokens

This is a detail most people miss: input tokens and output tokens are priced differently.

Output tokens almost always cost 3-5x more than input tokens.

ModelInput Price (per MTok)Output Price (per MTok)Output Multiplier
GPT-4o$2.50$10.004x more
Claude Sonnet 4.5$3.00$15.005x more
Claude Opus 4$15.00$75.005x more
Claude Haiku 3.5$0.80$4.005x more
GPT-4o mini$0.15$0.604x more
✅ Pro Tip

If your AI is writing long, verbose responses and you only need a short answer — you're overpaying on output tokens. Tell the model to be concise. A simple instruction like "Answer in 2-3 sentences" can cut your output costs by 80%.

4

The Context Window

The context window is the total amount of text (in tokens) that a model can "see" at one time. Think of it as the AI's working memory.

ModelContext WindowRoughly Equivalent To
GPT-4o128K tokens~200 pages / a short novel
Claude Sonnet 4.5200K tokens~300 pages / a full novel
Gemini 1.5 Pro2M tokens~3,000 pages / several textbooks

What happens when you hit the limit?

The AI doesn't crash — it just starts forgetting the oldest parts of the conversation. This is called "falling out of context." The AI silently drops your earlier messages to make room for new ones.

⚠️ Warning

A 200K context window doesn't mean you should USE all 200K tokens. A bigger context means a bigger bill. If you stuff 100K tokens of context into every request, you're paying for 100K input tokens every single time you send a message.

The trap: Just because a model CAN handle 200K tokens doesn't mean it SHOULD. Performance degrades on very long contexts. The model may "lose focus" on important details buried in the middle of a massive context.

5

Long Conversations: The Hidden Cost

This is the single biggest gotcha for most people. Here's what actually happens in a conversation with an AI:

Message 1: You send 100 tokens ──> AI reads 100 tokens Message 2: You send 100 tokens ──> AI reads 300 tokens (msg 1 + reply + msg 2) Message 3: You send 100 tokens ──> AI reads 600 tokens (all history + msg 3) Message 4: You send 100 tokens ──> AI reads 1,000 tokens (all history + msg 4) ... Message 20: You send 100 tokens ──> AI reads 10,000+ tokens (ENTIRE conversation)

Every message re-sends the ENTIRE conversation history. The AI doesn't "remember" — it re-reads everything from scratch each time.

The snowball effect

A 20-message conversation doesn't cost 20x a single message. It costs closer to 200x because each message includes all previous messages as input.

🚨 The #1 Money Pit

A single long conversation with a powerful model can easily cost $1-5+ in tokens. If you have users running long conversations with your AI product, this adds up to thousands per month — fast.

What to do about it

6

Model Selection & Pricing

Not every task needs the most powerful (and expensive) model. Choosing the right model for the job is the easiest way to cut costs.

The model hierarchy

Haiku / Mini
$
Sonnet / GPT-4o
$$
Opus / o1
$$$$$

When to use what

TaskBest Model TierWhy
Classify text, extract data, simple Q&AHaiku / MiniFast, cheap, good enough
Write content, code, analysisSonnet / GPT-4oGreat quality, reasonable cost
Complex reasoning, architecture, researchOpus / o1Best quality, premium cost
Summarize text, format dataHaiku / MiniDon't overpay for simple tasks
✅ Pro Tip

Route by task complexity. In production apps, use a small model to classify the request first, then route complex requests to a powerful model and simple ones to a cheap model. This alone can cut costs 50-70%.

7

System Prompts: The Silent Token Eaters

A system prompt is the hidden instruction you give the AI before the user ever types anything. Things like "You are a helpful customer service agent for Acme Corp..."

Here's the problem: the system prompt is sent with EVERY single message.

Every API call includes: ┌─────────────────────────────────┐ │ System Prompt (500 tokens) │ ← Sent EVERY time │ Conversation History (varies) │ ← Growing EVERY time │ User's New Message (100 tokens)│ ← The actual question └─────────────────────────────────┘ If your system prompt is 2,000 tokens and a user sends 50 messages: System prompt alone = 2,000 x 50 = 100,000 tokens That's $0.30 just for the system prompt (Sonnet pricing)

How to fix it

⚠️ Common Mistake

Pasting your entire company FAQ, product docs, or lengthy persona description into the system prompt. A 5,000-token system prompt costs you $0.015 per message just for the prompt itself — before the user even says anything.

8

Caching: Stop Paying Twice

Prompt caching is one of the most powerful cost-saving features available. The concept is simple: if you're sending the same content repeatedly, cache it so you only pay full price once.

How prompt caching works

Without Caching: Request 1: [System Prompt + Context] → Full price Request 2: [System Prompt + Context] → Full price (again!) Request 3: [System Prompt + Context] → Full price (again!!) With Caching: Request 1: [System Prompt + Context] → Full price (cached) Request 2: [Cached ✓] + new message → 90% discount on cached part Request 3: [Cached ✓] + new message → 90% discount on cached part

Caching savings by provider

ProviderCache Write CostCache Read CostSavings
Anthropic (Claude)1.25x base price (once)0.1x base price90% on reads
OpenAI (GPT)Free (automatic)0.5x base price50% on reads
Google (Gemini)Free0.25x base price75% on reads

What should you cache?

✅ Real Savings Example

A customer support bot with a 3,000-token system prompt handling 1,000 messages/day:
Without caching: $9.00/day in system prompt costs alone
With caching: $0.90/day — saving $243/month from one simple change.

9

Setting Limits & Budgets

Every AI provider gives you tools to set spending limits. Use them. An uncapped API key is a ticking time bomb.

Limits you should set immediately

Limit TypeWhat It DoesWhere to Set It
Monthly budget capHard stop when you hit $X/monthProvider dashboard
Rate limitMax requests per minute/hourProvider dashboard or your code
Max tokens per requestLimit how long the AI's response can beAPI parameter: max_tokens
Max conversation lengthCap how many messages before resetYour application code
Per-user daily limitPrevent one user from burning your budgetYour application code
Alert thresholdsEmail you when spending hits 50%, 80%Provider dashboard
// Example: Setting max_tokens in an API call // This limits the AI's response length const response = await anthropic.messages.create({ model: "claude-sonnet-4-5-20250929", max_tokens: 500, // AI can only respond with 500 tokens max messages: [{ role: "user", content: userMessage }] }); // Without max_tokens, the AI might write 4,000 tokens when you only needed 200 // That's 8x the output cost — for nothing
🚨 Horror Story

A developer left an API key in a public repo. A bot found it and ran thousands of requests. The bill: $14,000 in one weekend. Always set budget caps, always rotate exposed keys, and always monitor usage.

10

Images, Files & Hidden Token Costs

Text isn't the only thing that costs tokens. Many people are shocked by how many tokens images and files consume.

Image token costs

Image SizeApproximate TokensCost (Sonnet)
Small thumbnail (100x100)~200 tokens$0.0006
Medium image (500x500)~1,000 tokens$0.003
Large image (1000x1000)~1,600 tokens$0.005
High-res photo (2000x2000)~3,200+ tokens$0.01+
Screenshot (1920x1080)~2,500 tokens$0.008

Other hidden token costs

⚠️ Watch Out

Sending 5 screenshots in one message could cost 12,000+ input tokens — that's more than a whole page of text. Resize images before sending them to the AI, or describe what's in the image instead.

11

Streaming vs Batch

Streaming

Streaming shows the AI's response word-by-word as it's generated (like how ChatGPT types in real time). Streaming costs the same in tokens — it doesn't save money. But it feels faster to users because they see output immediately.

Batch processing

If you have hundreds or thousands of requests that don't need instant answers, batch APIs offer 50% discounts.

Use CaseBest ApproachWhy
Live chatbotStreamingUsers need instant responses
Processing 1,000 documentsBatch50% cost savings, no rush
Nightly report generationBatchSave money, run overnight
Interactive code assistantStreamingDevelopers want real-time output
✅ Pro Tip

Anthropic's Batch API gives you 50% off and processes within 24 hours. If your workload can wait, this is free money.

12

The Gotchas Nobody Tells You

These are the things that catch people off guard. Bookmark this section.

1. Retries multiply your cost

If your code automatically retries failed requests, you're paying for every attempt. Three retries = 3x the cost for one answer. Always implement exponential backoff and set a retry limit.

2. "Temperature" doesn't affect cost, but it affects waste

Higher temperature = more creative but sometimes nonsensical responses. If the AI gives a bad answer and the user has to ask again, you just paid double.

3. Empty or error responses still cost tokens

If the AI returns an error or a useless response, you still paid for the input tokens. Validate inputs before sending them.

4. Thinking tokens (extended thinking / chain-of-thought)

Some models now support "thinking" or "reasoning" modes where the AI works through a problem step by step. Those thinking tokens count as output tokens — the most expensive kind. A model "thinking" for 5,000 tokens before giving a 200-token answer means you're paying for 5,200 output tokens.

5. Conversation forking multiplies costs

If a user edits an earlier message (like in ChatGPT or Claude), the AI re-processes everything from that point forward. That's a whole new conversation branch, paid in full.

6. Tool use / function calling adds tokens

Every tool definition you give the AI is sent as tokens. 10 tools with complex schemas can add 2,000-5,000 tokens to every request — before the user even says anything.

7. The "helpful" AI problem

AI models love to be thorough. Ask a yes/no question, get a 500-word essay. That's 500 output tokens you didn't need. Be specific in your prompts: "Answer with only yes or no."

🚨 The Biggest Gotcha

You can't un-send tokens. Once the API call is made, you're charged — even if you cancel the stream mid-response, even if the answer is wrong, even if your app crashes before showing it to the user. Design defensively.

13

Token Counting Tools

You don't have to guess how many tokens something is. Use these tools:

ToolWorks WithType
Anthropic Token Counter (API)Claude modelsAPI endpoint
OpenAI Tokenizer (tiktoken)GPT modelsPython library / web tool
Anthropic ConsoleClaudeUsage dashboard
OpenAI Usage DashboardGPT modelsWeb dashboard
LLM Price Check (llm-price.com)All modelsPrice comparison website
✅ Pro Tip

Check your provider's usage dashboard weekly. Set up email alerts at 50% and 80% of your budget. Surprises are expensive in the token world.

14

Cost Estimation Cheat Sheet

Quick reference for estimating costs before you build:

Content TypeApprox. TokensReal-World Example
A tweet (280 chars)~50 tokensQuick classification or sentiment
A paragraph~100-150 tokensSummary request
An email~200-500 tokensDraft or reply generation
A full page of text~500-700 tokensDocument analysis
A blog post~1,000-3,000 tokensContent generation
A code file (200 lines)~1,500-2,500 tokensCode review or debugging
A 10-page PDF~5,000-7,000 tokensDocument Q&A
A book chapter~10,000-15,000 tokensLong-form analysis
💡 Quick Math

Rule of thumb: Take your word count, multiply by 1.3, and you have a rough token estimate. For code, multiply by 1.5-2x because of syntax characters.

15

Provider Comparison

Each provider does things slightly differently. Here's what matters:

FeatureAnthropic (Claude)OpenAI (GPT)Google (Gemini)
Top model context200K tokens128K tokens2M tokens
Prompt caching90% savings50% savings75% savings
Batch discount50% off50% offVaries
Budget capsYes (dashboard)Yes (dashboard)Yes (dashboard)
Free tierLimitedLimitedGenerous
Cheapest modelHaiku ($0.80/MTok in)GPT-4o mini ($0.15/MTok in)Flash ($0.075/MTok in)
💡 Key Takeaway

No single provider is cheapest for everything. Google Gemini has the largest context window and cheapest small models. Anthropic has the best caching savings. OpenAI has the broadest ecosystem. Pick based on your specific use case.

16

Real-World Scenarios

Scenario 1: "I just want to build a chatbot for my small business"

You build a customer support chatbot. 50 customers/day, average 8 messages each.

// The math System prompt: 1,000 tokens Average conversation length: 8 messages Average tokens per exchange: ~2,000 tokens (input + output, growing) Total per conversation: ~12,000 tokens Daily (50 conversations): 600,000 tokens Monthly: ~18M tokens // Cost with Claude Sonnet (blended rate ~$6/MTok) Monthly cost: ~$108/month // With caching + Haiku for simple questions (70% of traffic) Monthly cost: ~$25/month ← Smart routing saves $83/month

Scenario 2: "I'm using AI to process my company's documents"

You upload 500 documents (average 5 pages each) for analysis.

// The math 500 documents x 5 pages x 700 tokens/page = 1,750,000 input tokens Analysis output per doc (~500 tokens): 250,000 output tokens // Cost with Claude Sonnet Input: 1.75M x $3/MTok = $5.25 Output: 0.25M x $15/MTok = $3.75 Total: $9.00 // With Batch API (50% off) Total: $4.50 // With Haiku instead (if quality is sufficient) Total: $1.40

Scenario 3: "My dev team uses AI coding assistants all day"

5 developers, each making ~100 AI requests per day with code context.

// The math Average request: 3,000 tokens input + 1,500 tokens output Per developer per day: 100 requests Team daily tokens: 2.25M tokens Team monthly tokens: ~67.5M tokens // Cost with Claude Sonnet Input: 45M x $3/MTok = $135 Output: 22.5M x $15/MTok = $337 Monthly: $472/month // With prompt caching (system prompt + common context) Monthly: ~$280/month ← Caching saves ~$190/month

🏆 10 Golden Rules of Token Management

1
Set budget caps on day one. Every provider lets you set spending limits. Do it before you write a single line of code. An uncapped API key is an unlimited credit card left on a park bench.
2
Long conversations are expensive conversations. Every message re-sends the entire history. Start new conversations for new topics. Summarize and reset when conversations get long.
3
Use the smallest model that gets the job done. Don't use Opus/GPT-4 for tasks that Haiku/GPT-4o-mini handles fine. Route by complexity — simple tasks to cheap models, hard tasks to powerful ones.
4
Cache everything you repeat. System prompts, document context, examples — if it's sent more than once, cache it. This alone can cut costs 50-90%.
5
Output tokens cost 3-5x more than input. Tell the AI to be concise. Set max_tokens on every request. A shorter response isn't just faster — it's cheaper.
6
Keep system prompts lean. Every token in your system prompt is charged on every single message. Cut the fluff. Be precise. Your wallet will thank you.
7
Monitor usage weekly, not monthly. Check your provider dashboard every week. Set alerts at 50% and 80% of budget. Catching a runaway cost on day 7 is better than finding out on day 30.
8
Images and files are token-heavy. A single screenshot can cost more tokens than a full page of text. Resize images, extract text when possible, and only send what's needed.
9
Use batch processing for non-urgent work. If it can wait hours instead of seconds, use the Batch API for 50% savings. Nightly reports, document processing, data analysis — batch it all.
10
Tokens sent are tokens paid — no refunds. Cancelled streams, failed requests, bad prompts — you pay for all of it. Validate inputs, test prompts thoroughly, and design for efficiency from the start.