You’re using Opus for every task — even formatting JSON and generating boilerplate. Sonnet handles those at 5x lower cost with identical quality. Model routing is the single biggest lever for cutting your API bill without sacrificing quality where it matters.
Model choice is the single biggest lever for cost at scale. Sonnet is 5x cheaper than Opus per token — $3/$15 per million input/output versus $15/$75. Routing tasks to the right model and effort level can cut your bill by 60-80% without sacrificing quality where it matters.
For budget caps and team spending policies, see Budget Controls. For cache-level optimization, see Cache & Prompt Optimization below.
Model Selection Guide
Model Selection Guide
| Task Type | Model | Why |
|---|---|---|
| Code formatting, linting fixes | claude-sonnet-4-6 | Mechanical transforms; Sonnet handles these equally well |
| Test generation | claude-sonnet-4-6 | Pattern-following work with clear templates |
| Summarization, documentation | claude-sonnet-4-6 | Text synthesis where speed matters more than depth |
| Complex refactoring | claude-opus-4-6 | Cross-file reasoning with architectural implications |
| Bug diagnosis | claude-opus-4-6 | Requires deep reasoning about state and side effects |
| Architecture decisions | claude-opus-4-6 | Trade-off analysis requiring broad context understanding |
Switch models with the --model flag:
# Sonnet for routine tasks — ~$0.005/callclaude -p "Add JSDoc comments to this file" --model claude-sonnet-4-6
# Opus for complex reasoning — ~$0.016/call minimumclaude -p "Find the race condition in this auth flow" --model claude-opus-4-6Compare model costs on the same task:
claude -p “Format this JSON: {a:1,b:2}” —output-format json | jq ‘{model: “opus”, cost: .total_cost_usd}’
claude -p “Format this JSON: {a:1,b:2}” —model claude-sonnet-4-6 —output-format json | jq ‘{model: “sonnet”, cost: .total_cost_usd}’
Is the Opus output better for this task? Or did you just pay 5x more for the same result?
Two-Pass Strategy
In CI pipelines, default to Sonnet and escalate to Opus only when Sonnet’s output fails validation or quality checks. A two-pass strategy — Sonnet drafts, Opus reviews only when needed — can cut pipeline costs by 60-80%.
#!/bin/bash# two-pass.sh — Sonnet first, Opus only if needed
# Pass 1: Sonnet draftRESULT=$(claude -p "Review this PR for issues" \ --model claude-sonnet-4-6 \ --output-format json \ --max-budget-usd 0.25)
REVIEW=$(echo "$RESULT" | jq -r '.result')
# Check if Sonnet flagged anything complexif echo "$REVIEW" | grep -qi "complex\|unclear\|needs deeper\|architecture"; then # Pass 2: Opus deep review RESULT=$(claude -p "Provide a thorough review of this PR, \ focusing on architectural concerns and edge cases" \ --model claude-opus-4-6 \ --output-format json \ --max-budget-usd 1.00) REVIEW=$(echo "$RESULT" | jq -r '.result')fi
echo "$REVIEW"Cost Comparison
Two-Pass vs Single-Pass Cost
| Strategy | 100 PRs/day | Monthly Cost |
|---|---|---|
| All Opus | 100 x $0.05 = $5/day | $110 |
| All Sonnet | 100 x $0.01 = $1/day | $22 |
| Two-pass (20% escalation) | 100 x $0.01 + 20 x $0.05 = $2/day | $44 |
The two-pass strategy costs 60% less than all-Opus while maintaining Opus-quality reviews on the PRs that need them.
In a wrapper script, set the default model based on the task category. Route formatting, linting, and boilerplate generation to Sonnet automatically. Reserve Opus for commands that explicitly request it or for tasks that require multi-file reasoning.
Effort Levels
The --effort flag controls how much thinking Claude does before responding. Lower effort means fewer output tokens, which directly reduces cost.
Effort Level Comparison
| Effort | Behavior | Best For | Relative Cost |
|---|---|---|---|
low | Minimal thinking, short responses | Lookups, simple transforms, yes/no questions | ~0.5x |
medium | Balanced thinking and output | Standard code generation, explanations | ~1x (baseline) |
high | Extended reasoning, thorough output | Complex debugging, multi-step analysis | ~1.5-2x |
max | Maximum reasoning depth | Hardest problems, architecture reviews | ~2-3x |
At scale, effort tuning adds up. If 70% of your team’s calls are simple lookups or formatting tasks, routing them through --effort low can cut total output token costs nearly in half for those calls:
# Low effort for simple tasksclaude -p "What is the return type of this function?" --effort low
# Default effort for standard workclaude -p "Write unit tests for the auth module"
# Max effort for the hardest problemsclaude -p "Diagnose why this distributed lock fails under contention" --effort maxEffort affects output token count, not input tokens. The system prompt overhead is the same regardless of effort level. For calls dominated by input cost (large context, many files), effort tuning has minimal impact. It matters most for calls that generate long responses.
Model + Effort Matrix
Combining model selection with effort levels gives you a fine-grained cost control matrix:
Model + Effort Cost Matrix
| Configuration | Use Case | Relative Cost |
|---|---|---|
| Sonnet + low effort | Formatting, simple lookups | ~0.1x (cheapest) |
| Sonnet + medium effort | Standard code gen, docs | ~0.2x |
| Opus + low effort | Quick expert opinions | ~0.5x |
| Opus + medium effort | Standard complex tasks | ~1x (baseline) |
| Opus + max effort | Hardest reasoning tasks | ~2-3x (most expensive) |
The cheapest configuration (Sonnet + low effort) costs roughly 10x less than the most expensive (Opus + max effort). For a team processing 1,000 calls/day, routing 70% to the cheapest tier can save thousands per month.
Cache & Prompt Optimization
Two back-to-back calls with identical system prompts should share a cache. They don’t — because your second call used a slightly different flag combination, invalidating the cache and paying full input token price twice. Caching is automatic, but whether you actually get cache hits depends on how you structure your workload.
Caching is automatic in Claude CLI, but how you structure your workload determines whether you get 90% savings or pay full price on every call. This section covers the caching strategies, batching patterns, and session management techniques that minimize your token costs at scale.
For budget caps and team policies, see Budget Controls.
How Caching Works
Every Claude CLI call includes a system prompt of 30,000-40,000+ tokens (tool definitions, safety rules, CLAUDE.md files). When two calls share the same system prompt, the second call reads from cache at one-tenth the input price instead of paying full price.
Cache pricing for Opus:
- Full input: $15 per million tokens
- Cache write: $18.75 per million tokens (1.25x full price — first call pays this)
- Cache read: $1.50 per million tokens (0.1x full price — subsequent calls)
The savings compound: after the first call writes the cache, every subsequent call with the same system prompt gets a 90% discount on input tokens.
Measure your cache hit rate. Run the same prompt twice:
claude -p “Explain caching” —output-format json | jq ‘{cache_read: .usage.cache_read_input_tokens, cache_create: .usage.cache_creation_input_tokens, cost: .total_cost_usd}’
Run it again immediately. How much did cache_read_input_tokens increase on the second call? That increase represents tokens you’re reading at 90% off.
Batch Similar Prompts
Calls that share the same system prompt benefit from cache reads. Run similar tasks together rather than interleaving different workloads:
# Good: batch similar tasks — second call gets cache hitsclaude -p "Review src/auth.ts for security issues" --output-format jsonclaude -p "Review src/auth.ts for performance issues" --output-format json
# Wasteful: interleave different system contextsclaude -p "Review src/auth.ts" --output-format jsonclaude -p "Format this JSON" --model claude-sonnet-4-6 --output-format jsonclaude -p "Review src/db.ts" --output-format jsonThe interleaved pattern forces cache eviction between calls. Grouping similar tasks keeps the cache hot.
Batch Processing Script
For large-scale batch operations, process all files of the same type together:
#!/bin/bash# batch-review.sh — process all files in a single batch for cache efficiency
BATCH_DIR="src/api"BUDGET_PER_FILE=0.25TOTAL_COST=0
echo "Reviewing all TypeScript files in $BATCH_DIR..."
for file in "$BATCH_DIR"/*.ts; do RESULT=$(claude -p "Review $file for security issues" \ --output-format json \ --max-budget-usd "$BUDGET_PER_FILE" \ --no-session-persistence \ --permission-mode bypassPermissions)
COST=$(echo "$RESULT" | jq -r '.total_cost_usd // 0') CACHE_READ=$(echo "$RESULT" | jq -r '.usage.cache_read_input_tokens // 0') CACHE_WRITE=$(echo "$RESULT" | jq -r '.usage.cache_creation_input_tokens // 0')
echo "$file: \$$COST (cache read: $CACHE_READ, cache write: $CACHE_WRITE)" TOTAL_COST=$(echo "$TOTAL_COST + $COST" | bc)done
echo "Total batch cost: \$$TOTAL_COST"The first file in the batch pays the cache write cost. Subsequent files get cache reads at 10% of the price.
Maintain Sessions
Resuming a session is cheaper than starting fresh. Every resumed turn reads the accumulated context from cache at the 0.1x rate instead of re-ingesting it:
# Start a sessionRESULT=$(claude -p "Analyze the codebase structure" --output-format json)SESSION=$(echo "$RESULT" | jq -r '.session_id')
# Resume — context is cached, subsequent turns are cheaperclaude -p "Now focus on the API layer" --resume "$SESSION" --output-format jsonclaude -p "Suggest improvements" --resume "$SESSION" --output-format jsonWhen to Resume vs. Start Fresh
Resume vs. Fresh Session
| Scenario | Best Approach | Why |
|---|---|---|
| Follow-up questions on same topic | Resume | Cached context saves 90% on input tokens |
| Same task, different file | New session | Old context is irrelevant noise — wastes tokens |
| Long-running analysis (10+ turns) | Resume with budget | Accumulated context grows — set budget to cap total |
| Parallel independent tasks | Separate sessions | No shared context — each session runs independently |
The Cache Math
Understanding the numbers helps you make better decisions about session management:
Cache Impact on a 10-Turn Opus Session
| Scenario | Calculation | Cost |
|---|---|---|
| Fresh call each time | 10 x 15K input tokens at $15/M | $2.25 |
| All cached | 1 cache write + 9 cache reads | $0.65 |
| Savings | 71% |
For a team of 10 developers each running 50 calls per day, the difference between cached and uncached workloads is roughly $240/day — over $5,000/month in savings from session management alone.
Disable Tools When Not Needed
Tool descriptions inflate the system prompt. For pure text tasks, stripping them out reduces input tokens:
claude -p "Summarize this document" --tools ""This removes tool definitions from the system prompt, trimming the per-call input token count. The savings are modest per call but compound across thousands of daily invocations.
Cache TTL Awareness
Cache entries have TTLs — 5 minutes for short-lived caches, 1 hour for extended caches. If you batch tasks with long gaps between them, the cache may expire and you pay full write cost again. Keep batch runs tight to maximize cache hits.
Optimizing for Cache TTL
Structure your CI pipelines to group Claude calls together rather than spreading them across pipeline stages:
# Good: Claude calls grouped together (cache stays hot)jobs: claude-review: steps: - run: claude -p "Lint check" --output-format json - run: claude -p "Security check" --output-format json - run: claude -p "Doc check" --output-format json
# Wasteful: Claude calls spread across stages (cache expires between)jobs: lint: steps: - run: eslint . - run: claude -p "Lint check" --output-format json # cache write test: steps: - run: npm test - run: claude -p "Security check" --output-format json # cache expired, write againCache Monitoring
Check cache utilization in the JSON response to verify your optimization is working:
RESULT=$(claude -p "Review this file" --output-format json)
# Check cache metricsCACHE_READ=$(echo "$RESULT" | jq '.usage.cache_read_input_tokens')CACHE_WRITE=$(echo "$RESULT" | jq '.usage.cache_creation_input_tokens')INPUT=$(echo "$RESULT" | jq '.usage.input_tokens')
echo "Cache read: $CACHE_READ tokens"echo "Cache write: $CACHE_WRITE tokens"echo "Uncached input: $INPUT tokens"
# Cache hit ratioTOTAL=$((CACHE_READ + CACHE_WRITE + INPUT))if [ "$TOTAL" -gt 0 ]; then RATIO=$(echo "scale=2; $CACHE_READ * 100 / $TOTAL" | bc) echo "Cache hit ratio: ${RATIO}%"fiA healthy production workload should show 80%+ cache hit ratio on input tokens. If you see high cache_creation_input_tokens on most calls, your batching strategy needs adjustment.
Next Steps
With both model routing and caching optimized, you have the key cost levers in place: Budget Controls limit spending, model optimization reduces per-call cost, and cache optimization reduces per-token cost. For the full cost picture including team benchmarks and startup overhead, see Cost at Scale.
Run claude -p “Hello” —output-format json | jq ‘.usage’ twice in a row. Compare cache_read_input_tokens between the two responses — the second call should show significantly more cache reads. That difference is the 90% savings kicking in. Structure your workloads to maximize this.