Tokens and Context Windows: Why Your AI Is Slow, Forgetful, and Wrong

A No-Jargon Guide for Developers New to LLMs

Slick Breakdown, Part 2 | By Guido A. Piccolino Jr.

If your AI feels slow, forgetful, or keeps making things up, it’s probably not the model. It’s the context window. And once you understand how tokens work, you can fix it in minutes.

Here’s the whole picture.

Part 1: What Are Tokens?

Tokens are the smallest units an LLM reads and writes. They’re not words, and they’re not characters. They’re subword chunks created by a tokenizer algorithm (BPE, if you want the name).

The Quick Math

Measure	Approximation
1 token	≈ 4 characters
1 token	≈ ¾ of a word
1,000 words	≈ 1,333 tokens
100 tokens	≈ 75 words

What Tokenizes Efficiently (and What Doesn’t)

Content Type	Token Efficiency
Plain English prose	Best
Markdown	Good
JSON / GraphQL schemas	Mediocre
Deeply nested code	Worst

Why this matters: A 200-line GraphQL schema can eat more tokens than a 400-word essay. If you’re pasting Amplify custom-queries.ts output into a prompt, you’re paying more than you think.

Part 2: What Is a Context Window?

Think of the context window like your own memory during a conversation.

When you talk to a friend for an hour, you don’t remember every single word. You hold onto the important stuff and let the fluff fall away. That’s how human memory works. And it’s how you should be thinking about your AI chats.

The context window is the total token budget for a single conversation turn or API call. It’s shared between what you send in and what the model sends back.

┌──────────────────────────────────────────┐
│           CONTEXT WINDOW (200K)          │
│                                          │
│  ┌──────────────┐  ┌──────────────────┐  │
│  │ INPUT TOKENS │  │  OUTPUT TOKENS   │  │
│  │ (your stuff) │  │  (Claude's reply)│  │
│  │              │  │                  │  │
│  │  • system    │  │  • the answer    │  │
│  │  • prompt    │  │  • code blocks   │  │
│  │  • history   │  │  • explanations  │  │
│  │  • files     │  │                  │  │
│  └──────────────┘  └──────────────────┘  │
│                                          │
│  INPUT + OUTPUT = CANNOT EXCEED WINDOW   │
└──────────────────────────────────────────┘

The Zero-Sum Rule

Every token you spend on input is a token Claude cannot spend on output. If your input consumes 190K of a 200K window, Claude only has 10K left to answer. It may truncate mid-thought.

What Counts as Input?

In a chat environment (claude.ai), input includes:

The entire conversation history (every message you sent and every reply Claude gave)
Any uploaded files (PDFs, images, code)
System-level instructions (hidden from you, added by the platform)
Tool definitions (if using MCP connectors, search, etc.)

In the API, input is:

Your system prompt
The full messages array
Tool and function definitions
Embedded images, PDFs (base64)

Part 3: The `/count_tokens` Endpoint

Anthropic provides a free API endpoint to count tokens before sending a real request. Use it. It will change how you build prompts.

Endpoint

POST https://api.anthropic.com/v1/messages/count_tokens

Required Headers

x-api-key: $ANTHROPIC_API_KEY
content-type: application/json
anthropic-version: 2023-06-01

Basic Example

curl https://api.anthropic.com/v1/messages/count_tokens \
  --header "x-api-key: $ANTHROPIC_API_KEY" \
  --header "content-type: application/json" \
  --header "anthropic-version: 2023-06-01" \
  --data '{
    "model": "claude-sonnet-4-6",
    "system": "You are an Angular 21 expert.",
    "messages": [{
      "role": "user",
      "content": "Refactor my LeagueService to use Signals."
    }]
  }'

Response:

{ "input_tokens": 28 }

What It Supports

Everything a real /messages call supports:

System prompts
Multi-turn conversation history
Tool and function definitions
Base64 images
Base64 PDFs
Extended thinking turns

Key Details

Free. No token charges. Rate-limited (100 to 8,000 RPM by tier).
Separate limits. Counting doesn’t eat into your production message quota.
Estimate. May differ slightly from actual usage (Anthropic adds system tokens you’re not billed for).
Extended thinking. Previous thinking blocks are ignored and don’t count.

Before/After Workflow

1. Paste your REAL prompt into a /count_tokens call
2. Note the token count                          → BEFORE
3. Apply compression strategies (see Part 5)
4. Re-run /count_tokens with compressed prompt
5. Note the new count                            → AFTER
6. Calculate: savings = (BEFORE - AFTER) / BEFORE × 100
7. Target: 30 to 40% reduction with no quality loss

Part 4: `/compact` and `/clear` in Chat

These are chat-environment commands used in Claude.ai and Claude Code to manage your context window mid-conversation.

`/compact`

What it does: Asks Claude to summarize the conversation so far into a compressed form, then continues from that summary instead of the full raw history.

When to use it:

You’ve been going back and forth for 20+ messages and responses are getting slower or less focused.
You notice Claude “forgetting” things you said earlier (it’s running out of window).
You want to keep working on the same topic but shed the dead weight of resolved sub-threads.

What it trades off: Detail from early messages gets lossy-compressed into a summary. If you need Claude to reference the exact wording of message #3, that precision may be gone after compaction.

Best practice: Use /compact before you hit the wall, not after. If Claude starts giving vague or contradictory answers, you’ve already lost context.

`/clear`

What it does: Wipes the entire conversation history and starts fresh with a completely empty context window.

When to use it:

You’re switching to a completely different topic.
The conversation has gone off the rails and compacting won’t save it.
You want maximum available context for a big new task (like pasting a full schema).

What it trades off: Everything. Claude remembers nothing from before the /clear.

Decision Matrix

Still on the same topic, but conversation is long?    → /compact
Switching topics entirely?                            → /clear
Need to paste something huge (schema, codebase)?      → /clear first
Claude giving confused or contradictory answers?      → try /compact
Compact didn't help?                                  → /clear

Part 5: 10 Strategies for Token Maximization

These strategies apply whether you’re using claude.ai, the API, or Claude Code.

Strategy 1: Front-Load Signal, Kill Ceremony

❌ "Hey Claude, I hope you're doing well. I was wondering if you
    could please take a look at my Angular component and help me
    understand what might be going wrong with the signals..."

✅ "This Angular 21 component's computed signal isn't updating
    when the source signal changes. Here's the code:"

Savings: About 30 tokens per prompt. Sounds small. Multiplied across a 40-message conversation, that’s 1,200 tokens freed.

Strategy 2: Prune Before Pasting

Don’t paste a 200-field GraphQL schema when you’re debugging one resolver.

❌ Paste entire schema.graphql (3,000 tokens)
✅ Paste only the League and Season types (400 tokens)

Strategy 3: Use Structured Delimiters

XML tags tokenize more efficiently than prose boundaries and give Claude cleaner parsing.

❌ "Here is the error message I'm seeing: ..."
   "And here is the code that produces it: ..."

✅ <error>TypeError: Cannot read property...</error>
   <code>export class LeagueService { ... }</code>

Strategy 4: Reference, Don’t Repeat

If you’ve already described a pattern, refer to it by name.

❌ Re-pasting the SCRAPE → CLONE → INJECT → REBRAND pipeline
   description for the 3rd time (200+ tokens)

✅ "Apply the same SCRAPE → CLONE → INJECT → REBRAND pattern
    from the deep-queries script." (12 tokens)

Strategy 5: Shape the Output

Tell Claude exactly what format you want. This saves output tokens (which also count against the window in multi-turn chat).

❌ "Can you help me fix this function?"
   → Claude writes 300 words of explanation + the fix

✅ "Fix this function. Return only the corrected code, no explanation."
   → Claude returns just the code (~80% fewer output tokens)

Strategy 6: Use `/compact` Strategically

Don’t wait until Claude forgets things. Plan your compaction:

Message 1-10:  Initial architecture discussion
Message 11:    /compact ← compress the "why", keep the "what"
Message 12-20: Implementation details
Message 21:    /compact ← compress again before the next phase
Message 22+:   Testing and refinement

Strategy 7: Start New Chats for New Topics

Each topic deserves a clean context window. Don’t discuss your Amplify auth module in the same thread where you were debugging CSS grid layouts.

Chat 1: "Amplify Gen 2 auth module"      → full 200K for auth
Chat 2: "CSS custom properties theming"  → full 200K for styling

Strategy 8: Batch Related Questions

Instead of 5 separate back-and-forth exchanges:

❌ "What does this error mean?"
   [Claude answers]
   "How do I fix it?"
   [Claude answers]
   "Will this break anything else?"
   [5 messages × 2 = 10 messages eating context]

✅ "This error appeared after I changed X.
    1. What causes it?
    2. What's the fix?
    3. Any side effects I should watch for?"
   [1 exchange instead of 5]

Strategy 9: Audit Your System Prompt (API Users)

Your system prompt is sent with every single request. Every token there gets multiplied by every call. Move one-time context into the user message instead.

❌ System prompt: 2,000 tokens (includes project history,
   coding standards, personal preferences, examples)
   × 50 API calls = 100,000 tokens wasted on repetition

✅ System prompt: 400 tokens (core persona + constraints only)
   User message: include project-specific context only when needed

Strategy 10: Know Your Ceiling

Context window:    200,000 tokens
Your input:       -180,000 tokens
                  ─────────────────
Available output:    20,000 tokens  ← Claude's entire answer budget

If your input is too large, Claude WILL truncate mid-thought.
Always leave breathing room.

Part 6: Quick Reference Card

Command / Concept	What It Does
Token	Smallest unit the LLM processes (about 4 chars)
Context window	Total token budget (input + output)
`/count_tokens`	Free API endpoint to measure input tokens
Online token counter	https://lunary.ai/anthropic-tokenizer
`/compact`	Compresses chat history to free up context
`/clear`	Wipes chat history completely
System prompt	Fixed-cost tokens sent with every API request
`max_tokens`	Caps output length (but real ceiling = window minus input)

Part 7: The One Rule to Remember

Every token has a cost, either in money (API) or in quality (chat). The less you waste on input, the more Claude can give you in output.

Treat your context window like RAM. You wouldn’t load your entire codebase into memory to debug one function. Don’t load your entire project history into a prompt to fix one bug.

This is Part 2 of the Slick Breakdown series, where I break down the tech behind AI in plain English, backed by real engineering experience. Next up on TikTok and Reels: turning a bad prompt into a great one, 60 seconds, before and after. Follow along so you don’t miss it.