Context Management — Agent Mastered

Your 50-turn session suddenly produces garbage output. Claude isn’t broken — it hit the context window limit and started compacting older messages to make room. The context you thought was permanent just got summarized into a few sentences.

Opus 4.6 and Sonnet 4.6 have a 1M token context window. That sounds enormous until you realize that your system prompt, CLAUDE.md files, MCP tool descriptions, conversation history, and extended thinking are all competing for the same window. Managing context is about deciding what stays, what gets compressed, and what you pay for every turn.

Effort Levels

The --effort flag controls how deeply Claude reasons before responding. Lower effort means fewer thinking tokens and faster answers, but also less thorough analysis for complex tasks.

Here’s what each effort level actually does:

Effort Levels

Level	Behavior	Best For	Token Impact
`low`	Minimal reasoning, fastest responses	Simple lookups, yes/no questions, classification	Near-zero thinking tokens
`medium`	Balanced reasoning	Moderate complexity, code formatting, summaries	Modest thinking budget
`high`	Full reasoning (default)	Most development tasks, code generation, debugging	10K+ thinking tokens possible
`max`	Maximum reasoning depth	Complex architecture, multi-file refactoring, subtle bugs	20K+ thinking tokens possible

For most work, leave effort at high (the default). Drop to low for quick lookups. Bump to max when you’re debugging something subtle and want Claude to think harder. The difference between low and high barely matters for simple questions but is significant for complex reasoning.

You can set effort via the CLI flag, an environment variable, or in your CLAUDE.md:

# CLI flag
claude -p "Refactor auth.py" --effort medium

# Environment variable
export CLAUDE_EFFORT=medium

# Or in CLAUDE.md
# Settings
effort: medium

For cost-sensitive CI pipelines, combine effort with MAX_THINKING_TOKENS to get full reasoning within a hard ceiling:

MAX_THINKING_TOKENS=8000 claude -p "Refactor auth.py" --effort high

This gives high-quality reasoning but prevents runaway thinking token spend. At Opus pricing ($75/M output tokens), an uncapped complex prompt can burn 20K+ thinking tokens — $1.50 in a single response just for thinking.

▸ Try This

Compare effort levels on the same prompt:

claude -p “Explain what a monad is” —effort low —output-format json | jq ‘{effort: “low”, tokens: .usage.output_tokens, cost: .total_cost_usd}‘

claude -p “Explain what a monad is” —effort high —output-format json | jq ‘{effort: “high”, tokens: .usage.output_tokens, cost: .total_cost_usd}‘

How much do output tokens and cost differ? Is the higher-effort answer actually better for this question?

Context Consumers

Every piece of information in Claude’s context window costs tokens. Use the /context command inside an interactive session to see a live breakdown.

What Fills the Context Window

Consumer	Typical Size	Notes
System prompt	5K-10K tokens	Always present, auto-cached
CLAUDE.md files	1K-10K tokens	All `.claude.md` files in hierarchy are loaded
MCP tool descriptions	10K-50K+ tokens	The silent context killer — 20+ MCP servers can shrink effective context to ~150K
Conversation history	Grows per turn	Largest consumer in long sessions
Extended thinking	10K+ per response	Output tokens, expensive at Opus rates ($75/M)
Tool results	Varies widely	Large file reads can consume 5K+ each

Context Budget Simulator

Where Do Your Tokens Go?

Used: 103,000 / 1,000,000 (10.3%)Healthy

MCP tool descriptions deserve special attention. Each tool description consumes 200-500 tokens. With 80+ tools across multiple MCP servers, you can lose 40K-50K tokens before you even start working. Audit with /context, disable unused servers in .claude/settings.json, and use ToolSearch deferred loading to load tool schemas on demand rather than upfront:

# Trigger deferred loading when MCP tools exceed 5% of context
ENABLE_TOOL_SEARCH=auto:5 claude

The /compact Command

When your context is filling up, the /compact command summarizes the conversation so far and replaces the full history with a compressed version. The key advantage over waiting for auto-compaction is control: you tell Claude what to preserve.

Proactive Compaction

$ /compact Focus on the database schema changes and migration decisions

Compacting conversation with focus: database schema changes and migration decisions...

Conversation compacted. Preserved:

- Schema v3 migration: added users.org_id column (non-nullable, FK to orgs)

- Decided against soft deletes in favor of audit_log table

- Migration rollback strategy: down migration drops column after data copy

Context usage: 18% (was 62%)

You can also embed compaction instructions directly in your CLAUDE.md so they apply automatically every time compaction occurs, whether manual or automatic:

# Compact instructions
When you are using compact, please focus on test output and code changes

Three ways to manage context bloat, each for a different situation:

/compact vs /clear vs --resume

Command	What Happens	Use When
`/compact`	Summarizes conversation, keeps working state	Context filling up, still working on the same task
`/compact [focus]`	Summarizes with emphasis on specified topics	You know which details matter most
`/clear`	Wipes conversation entirely, re-reads CLAUDE.md	Starting a completely new task in same session
`—resume`	Loads a previous session from disk	Continuing work across terminal sessions

Auto-Compaction

Auto-compaction triggers at approximately 80-90% context window utilization. When it fires, Claude summarizes the entire conversation, replaces the full history with that summary, re-reads CLAUDE.md from disk, and condenses tool results to their essentials. You do not control what gets prioritized — the model decides.

The problem with auto-compaction is that it is reactive. By the time it triggers, subtle early details may already be compressed beyond recovery. A design decision rationale from turn 3 might become a single sentence by turn 40. The proactive strategy is to run /compact manually at around 60% utilization with explicit focus instructions, so you decide what survives rather than leaving it to the automatic trigger.

For tasks that require reading many files or producing verbose output, consider delegating to a subagent using the Task tool. Subagents run in their own context windows, keeping the main conversation lean while still doing thorough analysis.

When Sessions Die

Sessions degrade after ~10 turns or ~100K cumulative tokens, depending on complexity. Symptoms: repetitive responses, forgotten context, slower latency. When you see these, run /compact to preserve key decisions and reset the working state, or start a fresh session with claude --resume if you need to carry forward only the essential context.

Cache Economics

Claude Code automatically caches system prompts and CLAUDE.md content. See Chapter 4 for full cost analysis. Here’s how caching applies specifically to long sessions:

Cache Tiers

Cache Type	Duration	Cost Multiplier	Notes
Ephemeral cache write	5 minutes	1.25x base input price	Auto-expires, created on every call
Long-lived cache write	1 hour	2.0x base input price	For stable content like system prompts
Cache read	N/A	0.1x base input price	90% savings on repeated content

The first call in a session writes the system prompt (~14K tokens) to cache at 1.25-2.0x the base rate. Every subsequent call within 5 minutes reads that cached content at 0.1x the base rate. After 5 minutes of inactivity, the ephemeral cache expires and the next call pays the full write cost again.

For a 100-turn Opus session, the savings are dramatic: without caching, 100 turns of 15K input tokens at $15/M costs roughly $22.50 in input tokens alone. With caching, 1 write plus 99 reads comes to about $2.51 — an 89% reduction on input tokens. Real-world total session costs drop from $50-100 to $10-19.

effort low -- Cold Cacheartifacts/06/effort_low.json

2 "type": "result",

3 "subtype": "success",

4 "is_error": false,

5 "duration_ms": 3673,

6 "num_turns": 1,

7 "result": "4",← C

8 "session_id": "b1489c73-aa97-4f05-9f70-287a76a336db",

9 "total_cost_usd": 0.060306,← A

10 "usage": {← B

11 "input_tokens": 3,

12 "cache_creation_input_tokens": 9104,

13 "cache_read_input_tokens": 6532,

14 "output_tokens": 5

15 }

16}

ACache miss: 9104 tokens written to cache at 1.25x rate

BPartial hit: 6532 tokens read from existing cache at 0.1x rate

C$0.060 -- inflated by cold cache, not by effort level

effort high -- Warm Cacheartifacts/06/effort_high.json

2 "type": "result",

3 "subtype": "success",

4 "is_error": false,

5 "duration_ms": 2952,

6 "num_turns": 1,

7 "result": "4",← C

8 "session_id": "28f7bd3a-4f86-4cfa-bb09-de29eda70d97",

9 "total_cost_usd": 0.01591,← A

10 "usage": {← B

11 "input_tokens": 3,

12 "cache_creation_input_tokens": 1383,

13 "cache_read_input_tokens": 14253,

14 "output_tokens": 5

15 }

16}

AWarm cache: only 1383 new tokens written

B14253 tokens read from cache -- 0.1x rate, huge savings

C$0.016 -- 3.8x cheaper, entirely from cache hits, not effort level

These two payloads were identical prompts (“What is 2+2?”) run 30 seconds apart. The effort low run cost 3.8x more than effort high — but not because of effort level. The entire cost difference comes from cache state. The first call had a cache miss (9104 creation tokens), while the second hit a warm cache (14253 read tokens). On subsequent runs with a warm cache, costs converge regardless of effort. For trivial prompts, effort level does not change the answer or the cost.

Gotcha

Auto-compaction fires at 80-90% context utilization and decides on its own what to keep. Subtle details from early in the conversation — design rationales, rejected approaches, edge case discussions — can be compressed to a single sentence or lost entirely. Run /compact proactively at ~60% utilization with explicit focus instructions to control what survives.

Tip

CLAUDE.md always survives compaction. When auto-compaction or /compact runs, CLAUDE.md is re-read from disk rather than being summarized. This means your project instructions, style guides, and custom compaction instructions persist across every compaction cycle. Put your most critical context there rather than relying on conversation history to preserve it.

Gotcha

The —effort low flag does not guarantee cheaper calls. As the payload comparison above shows, cache state has a far bigger impact on cost than effort level for simple prompts. Effort primarily affects reasoning depth and thinking tokens on complex tasks. If you want to minimize cost, focus on cache hit rates: batch related calls together and avoid pausing more than 5 minutes between them.

Context Management Playbook

A practical cadence for keeping context healthy in long sessions:

Proactive /compact at 60% utilization. Do not wait for auto-compaction at 80-90%. By that point, subtle early context (design rationales, rejected approaches, variable naming decisions) may already be degraded. Run /compact with explicit focus instructions: /compact Focus on the API contract changes and migration decisions.

MCP overhead audit. Each MCP server adds 200-500 tokens of tool descriptions to every turn, even when idle. With 20 servers, that is 4,000-10,000 tokens of overhead per turn, silently shrinking your effective context. Run /context periodically to check. Disable servers you are not actively using.

Monorepo CLAUDE.md splitting. A single 47K-word CLAUDE.md dramatically degrades performance. Split into subdirectory-specific files — splitting from 47K to 9K words across multiple files improves response quality and reduces startup overhead.

Delegate to subagents. For tasks requiring reading many files or producing verbose output, use the Agent tool. Subagents run in their own context windows, keeping the main conversation lean while still doing thorough analysis. The result summary returns to the main context, not the full working state.

/clear warning. The /clear command permanently deletes all session history with no recovery mechanism. Unlike /compact, which preserves a summary, /clear wipes everything and re-reads only CLAUDE.md. Use it only when starting a completely new task.

/clear Is Permanent

/clear permanently deletes all conversation history with no undo. Unlike /compact (which preserves a summary), /clear wipes everything. The only things that survive are CLAUDE.md files (re-read from disk) and any files Claude wrote to the filesystem. Use /compact to free context while preserving key decisions.

See This in Action

See how forking sessions (—fork-session) manages context efficiently — each discussion fork inherits the review’s full context without resending it — in Build an MR Reviewer, Part 4: Fork Sessions for Discussions.

→ Now Do This

Run a call with —effort low for a simple question: claude -p “What is 2+2?” —effort low —output-format json | jq ‘.usage’. Compare output_tokens to the same call without the flag. For simple tasks, low effort saves tokens and money with no quality loss.