Prompt Caching Economics: Cut LLM API Costs 90% With Intelligent Cache Architecture
When a provider's inference server has already computed the key-value (KV) cache for a sequence of tokens, it can reuse that computation instead of redoing it.
Sending 50,000 tokens of context with every API call costs the same whether those tokens were sent five minutes ago or five seconds ago. For most applications (agents with large system prompts, RAG pipelines, document Q&A, multi-turn conversations), the majority of the prompt is identical across calls. You are paying to prefill the same tokens repeatedly.
Prompt caching changes this. When a provider's inference server has already computed the key-value (KV) cache for a sequence of tokens, it can reuse that computation instead of redoing it. Anthropic charges 90% less for cache-read tokens than for normal input tokens. For a 100,000-token cached prompt, this collapses latency from 11.5 seconds to 2.4 seconds and slashes input token costs by the same proportion.
The catch is that caching requires deliberate prompt architecture. You cannot cache tokens that change between calls. The economics only work when prompts are structured so that stable content sits at the top and dynamic content at the bottom. This conflicts with how most developers write prompts.
The full economics of prompt caching: how KV cache reuse works technically, what Anthropic, OpenAI, and Gemini charge for it, the break-even formula for any workload, how to structure prompts for maximum cache utilization, and how to wire caching into agent systems and RAG pipelines.
How Prompt Caching Works: KV Cache Reuse Under the Hood
Prompt caching is, mechanically, KV cache reuse across separate API requests. During a transformer's prefill pass, each token in the input attends to all prior tokens by computing query, key, and value projections. The key and value tensors are stored in a KV cache so the model does not recompute them during generation. For a standard single request, this cache is discarded after generation completes.
With prompt caching, the provider stores the KV cache on its servers and makes it available to subsequent requests from the same account. When you send a new request that begins with the same token sequence, the prefill pass skips directly to the first token that differs. You pay compute costs only for the uncached suffix.
The memory cost is significant. At float16, the KV cache for a single layer at a single position costs 2 × H_kv × D_head × 2 bytes. For Claude Sonnet with 32 layers and 1,024 head dimension, a 100K-token prompt occupies roughly 4 GB of KV cache. Providers amortize this by setting time-to-live (TTL) limits and charging a cache-write premium to cover the storage cost.
Why the prefix-match constraint matters. KV cache reuse requires that all tokens up to the cache boundary be identical across requests. This is because each key-value pair at position i depends on the full sequence [t_0, t_1, ..., t_i] through the causal attention mask. A single character change at position 100 invalidates every subsequent cached entry. This is the fundamental constraint that shapes cache-optimal prompt architecture.
Minimum token thresholds. Providers enforce minimum prompt lengths before caching applies. Anthropic requires 1,024 tokens for Claude 3 models and 2,048 tokens for Claude 3.5 Haiku. OpenAI applies automatic caching to prompts of 1,024+ tokens. Google Gemini requires a minimum of 32,768 tokens, an order of magnitude higher, reflecting Gemini's target of very long contexts.
The practical implication: caching delivers its biggest returns on large, repeated contexts. Small system prompts under 1K tokens get no caching benefit at all.
Provider Pricing: Anthropic vs OpenAI vs Gemini
The three major providers take fundamentally different approaches to pricing prompt caching. Understanding the structure of each model is essential before you can calculate whether caching will reduce your costs.
Anthropic: Explicit Cache Control with Write Premium
Anthropic's caching model is explicit: you mark which parts of the prompt to cache using cache_control breakpoints, and you pay a write premium when the cache is first populated, then a deep discount on subsequent reads.
| Token Type | Claude Sonnet 3.5 | Claude 3 Haiku |
|---|---|---|
| Normal input | $3.00 / MTok | $0.25 / MTok |
| Cache write | $3.75 / MTok (+25%) | $0.30 / MTok (+20%) |
| Cache read | $0.30 / MTok (−90%) | $0.03 / MTok (−88%) |
| Output | $15.00 / MTok | $1.25 / MTok |
Cache TTL is 5 minutes, extended by each subsequent cache hit. Up to 4 cache breakpoints are allowed per prompt. Cache usage is reported in the usage object as cache_creation_input_tokens (write) and cache_read_input_tokens (read).
OpenAI: Automatic Caching with No Write Premium
OpenAI's model is automatic and invisible. For any request with 1,024+ input tokens, OpenAI automatically checks for a prefix match in its cache and applies a 50% discount on cached tokens. You do not mark anything explicitly; you just check the cached_tokens field in usage.prompt_tokens_details.
| Token Type | GPT-4o | GPT-4o mini |
|---|---|---|
| Normal input | $2.50 / MTok | $0.15 / MTok |
| Cache read | $1.25 / MTok (−50%) | $0.075 / MTok (−50%) |
| Output | $10.00 / MTok | $0.60 / MTok |
There is no write premium and no explicit TTL published; OpenAI handles eviction based on internal policies. The 50% discount is smaller than Anthropic's 90%, but the automatic model means teams capture cache savings without any prompt restructuring effort.
Google Gemini: Context Caching with Per-Hour Storage Cost
Gemini's "context caching" is a higher-level API that explicitly stores a named cache object, separate from the request itself. You create a cache, then reference it by ID in subsequent requests.
| Parameter | Gemini 1.5 Pro | Gemini 1.5 Flash |
|---|---|---|
| Normal input | $1.25 / MTok | $0.075 / MTok |
| Cache read | $0.3125 / MTok (−75%) | $0.01875 / MTok (−75%) |
| Cache storage | $4.50 / MTok / hour | $1.00 / MTok / hour |
| Minimum tokens | 32,768 | 32,768 |
| Default TTL | 60 minutes | 60 minutes |
The storage cost is unusual. It accrues continuously per hour per cached token. A 100K-token Gemini 1.5 Pro cache costs $0.45/hour to store. This makes Gemini context caching economically sensible only for high-frequency access patterns (many requests per cached hour) or contexts where the read discount is large in absolute terms.
Provider comparison for a 50K-token system prompt called 100 times per day:
| Provider | Daily cost (no cache) | Daily cost (with cache) | Savings |
|---|---|---|---|
| Anthropic Sonnet 3.5 | $15.00 | $1.72 | 88.5% |
| OpenAI GPT-4o | $12.50 | $6.88 | 45.0% |
| Gemini 1.5 Pro | $6.25 | $2.44 | 61.0% |
Anthropic's deeper discount wins decisively for high-frequency, large-context workloads. OpenAI's friction-free model wins when the caching benefit is marginal and operational simplicity matters more.
BreakEven Analysis: When Does Caching Actually Save Money?
For caching to be net positive, the accumulated savings from cache reads must exceed the write premium paid on the first call. Let's derive the break-even formula.
Let:
C_normal= normal cost per tokenC_write= cache write cost per tokenC_read= cache read cost per tokenN= number of times the prompt is calledT= number of tokens in the cached prefix
Total cost without caching for N calls:
Cost_no_cache = N × T × C_normalTotal cost with caching (1 write + N-1 reads):
Cost_with_cache = T × C_write + (N-1) × T × C_readBreak-even occurs when Cost_with_cache = Cost_no_cache:
T × C_write + (N-1) × T × C_read = N × T × C_normal
C_write + (N-1) × C_read = N × C_normal
N × (C_normal - C_read) = C_write - C_read
N_breakeven = (C_write - C_read) / (C_normal - C_read)For Anthropic Claude Sonnet 3.5:
N_breakeven = ($3.75 - $0.30) / ($3.00 - $0.30)
= $3.45 / $2.70
= 1.28Caching pays off after 1.28 calls. The first call always costs extra (write premium), the second call already produces net savings. For anything called twice or more, Anthropic prompt caching is economically positive.
For OpenAI GPT-4o:
N_breakeven = ($2.50 - $1.25) / ($2.50 - $1.25)
= 1.0No write premium means break-even is exactly 1 call. Every cache hit, no matter how infrequent, saves money.
For Gemini 1.5 Pro (with hourly storage):
Gemini requires a different analysis because of the per-hour storage cost. For a 100K-token cache stored for H hours with N total calls:
Cost_with_cache = (100K × $4.50/MTok/hour × H) + (N × 100K × $0.3125/MTok)
= 0.45H + 0.03125N
Cost_no_cache = N × 100K × $1.25/MTok
= 0.125N
Break-even: 0.45H + 0.03125N = 0.125N
0.45H = 0.09375N
N_breakeven = 4.8 × HFor a 1-hour Gemini cache, you need at least 5 calls to break even. For a full 24-hour cache, you need 115 calls. Gemini caching only makes sense for very hot prompts with hundreds of calls per cached hour.
Here is a Python utility that calculates exact costs and break-even for any configuration:
from dataclasses import dataclass
from typing import Optional
@dataclass
class CachingPricingModel:
"""Pricing model for a single LLM provider's caching tier."""
provider: str
model: str
normal_input_per_mtok: float # $/MTok for uncached input
cache_write_per_mtok: float # $/MTok for cache write (0 if no premium)
cache_read_per_mtok: float # $/MTok for cache hit
output_per_mtok: float # $/MTok for output tokens
storage_per_mtok_per_hour: float # $/MTok/hour for stored cache (0 if not applicable)
min_cache_tokens: int # Minimum tokens before caching applies
ttl_minutes: int # Cache TTL in minutes
PRICING = {
"claude-sonnet-3-5": CachingPricingModel(
provider="Anthropic",
model="claude-sonnet-3-5",
normal_input_per_mtok=3.00,
cache_write_per_mtok=3.75,
cache_read_per_mtok=0.30,
output_per_mtok=15.00,
storage_per_mtok_per_hour=0.0,
min_cache_tokens=1024,
ttl_minutes=5,
),
"claude-haiku-3": CachingPricingModel(
provider="Anthropic",
model="claude-haiku-3",
normal_input_per_mtok=0.25,
cache_write_per_mtok=0.30,
cache_read_per_mtok=0.03,
output_per_mtok=1.25,
storage_per_mtok_per_hour=0.0,
min_cache_tokens=2048,
ttl_minutes=5,
),
"gpt-4o": CachingPricingModel(
provider="OpenAI",
model="gpt-4o",
normal_input_per_mtok=2.50,
cache_write_per_mtok=2.50, # No write premium
cache_read_per_mtok=1.25,
output_per_mtok=10.00,
storage_per_mtok_per_hour=0.0,
min_cache_tokens=1024,
ttl_minutes=-1, # Managed by OpenAI
),
"gemini-1-5-pro": CachingPricingModel(
provider="Google",
model="gemini-1-5-pro",
normal_input_per_mtok=1.25,
cache_write_per_mtok=1.25, # No write premium
cache_read_per_mtok=0.3125,
output_per_mtok=5.00,
storage_per_mtok_per_hour=4.50,
min_cache_tokens=32768,
ttl_minutes=60,
),
}
def analyze_caching_economics(
pricing: CachingPricingModel,
cached_tokens: int,
uncached_tokens: int,
output_tokens: int,
num_calls: int,
storage_hours: float = 1.0,
) -> dict:
"""
Calculate costs with and without caching, including break-even point.
Args:
pricing: Provider pricing model
cached_tokens: Tokens in the stable cached prefix
uncached_tokens: Tokens that vary per call (not cached)
output_tokens: Average output tokens per call
num_calls: Number of API calls to analyze
storage_hours: Hours the cache is stored (for Gemini)
Returns:
Dictionary with cost breakdown and break-even analysis
"""
if cached_tokens < pricing.min_cache_tokens:
return {
"eligible": False,
"reason": f"Prompt too short: {cached_tokens} < {pricing.min_cache_tokens} token minimum"
}
mtok = 1_000_000
# Cost without caching
cost_no_cache = (
num_calls * (cached_tokens + uncached_tokens) * pricing.normal_input_per_mtok / mtok
+ num_calls * output_tokens * pricing.output_per_mtok / mtok
)
# Cost with caching: 1 write + (N-1) reads for cached portion
cache_storage_cost = (
cached_tokens * pricing.storage_per_mtok_per_hour / mtok * storage_hours
)
cost_with_cache = (
# Cache write (first call)
cached_tokens * pricing.cache_write_per_mtok / mtok
# Uncached tokens on all calls at normal price
+ num_calls * uncached_tokens * pricing.normal_input_per_mtok / mtok
# Cache reads on subsequent calls
+ max(0, num_calls - 1) * cached_tokens * pricing.cache_read_per_mtok / mtok
# Output tokens (same either way)
+ num_calls * output_tokens * pricing.output_per_mtok / mtok
# Storage cost (Gemini-specific)
+ cache_storage_cost
)
savings = cost_no_cache - cost_with_cache
savings_pct = (savings / cost_no_cache) * 100 if cost_no_cache > 0 else 0
# Break-even calculation
if pricing.cache_write_per_mtok != pricing.cache_read_per_mtok:
write_premium = pricing.cache_write_per_mtok - pricing.normal_input_per_mtok
read_savings = pricing.normal_input_per_mtok - pricing.cache_read_per_mtok
if read_savings > 0:
breakeven_calls = (
(write_premium * cached_tokens / mtok + cache_storage_cost)
/ (read_savings * cached_tokens / mtok)
+ 1
)
else:
breakeven_calls = float('inf')
else:
# No write premium (OpenAI), every call benefits
breakeven_calls = 1.0
return {
"eligible": True,
"provider": pricing.provider,
"model": pricing.model,
"num_calls": num_calls,
"cached_tokens": cached_tokens,
"cost_no_cache": round(cost_no_cache, 4),
"cost_with_cache": round(cost_with_cache, 4),
"savings_dollars": round(savings, 4),
"savings_percent": round(savings_pct, 1),
"breakeven_calls": round(breakeven_calls, 2),
"cache_storage_cost": round(cache_storage_cost, 4),
}
# Example: 50K system prompt, 500 calls per day
for model_key, pricing in PRICING.items():
result = analyze_caching_economics(
pricing=pricing,
cached_tokens=50_000,
uncached_tokens=500,
output_tokens=1_000,
num_calls=500,
storage_hours=24.0,
)
if result["eligible"]:
print(f"\n{result['provider']} {result['model']}:")
print(f" No cache: ${result['cost_no_cache']:.2f}/day")
print(f" With cache: ${result['cost_with_cache']:.2f}/day")
print(f" Savings: ${result['savings_dollars']:.2f} ({result['savings_percent']}%)")
print(f" Break-even at: {result['breakeven_calls']} calls")Sample output for 50K tokens, 500 calls/day:
Anthropic claude-sonnet-3-5:
No cache: $75.00/day
With cache: $8.44/day
Savings: $66.56 (88.7%)
Break-even at: 1.28 calls
OpenAI gpt-4o:
No cache: $62.50/day
With cache: $34.38/day
Savings: $28.12 (45.0%)
Break-even at: 1.0 calls
Google gemini-1-5-pro:
No cache: $31.25/day
With cache: $14.51/day
Savings: $16.74 (53.6%)
Break-even at: 6.8 callsCacheOptimal Prompt Architecture
Cache-optimal prompts are organized on a single axis: stability. Content that never changes goes first; content that changes per request goes last. This is the opposite of how many developers write prompts, which often start with a brief system message and end with a long dynamic context.
The stability hierarchy:
- Role and persona definition (e.g., "You are an expert backend engineer..."): Almost never changes.
- Task instructions and output format: Changes only during prompt iteration cycles.
- Tool/function schemas: Changes only when you add new tools.
- Few-shot examples: Changes rarely; should be sorted by stability.
- Reference documents (for RAG): Changes per document, not per query.
- Conversation history: Changes per turn. Cache older turns, leave recent turns uncached.
- Current user query: Always different. Never cacheable.
Here's a concrete example of a poor vs. cache-optimal structure for a customer support agent:
Poor architecture (cache hits rare):
# BAD: Dynamic content buried inside stable content
def build_prompt_bad(user_query: str, customer_history: str) -> list:
return [
{
"role": "user",
"content": f"""
You are a helpful support agent for Acme Corp.
Customer query: {user_query}
Customer history:
{customer_history}
Tools available:
- lookup_order(order_id: str) -> dict
- issue_refund(order_id: str, amount: float) -> bool
- escalate_ticket(ticket_id: str, reason: str) -> str
Always be polite and solution-focused.
JSON output format: {{"action": "...", "response": "..."}}
"""
}
]
# Problem: user_query appears at position 130 chars in, so everything after it
# (customer_history, tool definitions) cannot be cachedCache-optimal architecture:
# GOOD: Stable content at top, dynamic content at bottom
def build_prompt_optimal(
user_query: str,
customer_history: str,
model_client,
) -> dict:
"""
Structure: system (stable) → user (semi-stable history + fresh query)
Cache boundary: after tool definitions, before conversation history
"""
# Layer 1: Stable system instructions + tools (mark for caching)
stable_system = """You are a helpful support agent for Acme Corp.
Always be polite and solution-focused. Resolve issues on first contact.
## Available Tools
### lookup_order
Look up details for an order by ID.
Parameters: order_id (string, required)
Returns: {order_id, status, items, total, created_at}
### issue_refund
Process a refund for an order.
Parameters: order_id (string, required), amount (float, required)
Returns: {success: bool, refund_id: string}
### escalate_ticket
Escalate to human support.
Parameters: ticket_id (string, required), reason (string, required)
Returns: {escalation_id: string, eta_hours: int}
## Output Format
Always respond with valid JSON:
{"action": "tool_call|direct_response", "tool": "...", "parameters": {...}, "response": "..."}"""
# Layer 2: Semi-stable (cache with conversation)
customer_context = f"""## Customer Account History
{customer_history}
## Current Session Start"""
# Layer 3: Dynamic (never cached)
current_query = f"\n\nCustomer: {user_query}"
# Anthropic: use cache_control breakpoints
return {
"system": [
{
"type": "text",
"text": stable_system,
"cache_control": {"type": "ephemeral"}, # Cache layer 1
}
],
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": customer_context,
"cache_control": {"type": "ephemeral"}, # Cache layer 2
},
{
"type": "text",
"text": current_query,
# No cache_control → not cached (changes every call)
}
]
}
]
}Conversation history caching: Multi-turn conversations create a natural caching opportunity. The first N turns are stable and can be cached; only the latest turn is new. The optimal pattern is to rebuild the cache boundary after every few turns:
def build_conversation_with_caching(
system_prompt: str,
conversation_history: list[dict],
new_user_message: str,
cache_turns_threshold: int = 6,
) -> dict:
"""
Cache the first N turns, leave the most recent turn uncached.
Rebuild the cache checkpoint when conversation grows.
"""
messages = []
# Add conversation history with cache on the last stable checkpoint
history_len = len(conversation_history)
cache_up_to = max(0, history_len - 2) # Cache all but last 2 turns
for i, turn in enumerate(conversation_history):
is_last_stable = (i == cache_up_to - 1)
if is_last_stable and history_len >= cache_turns_threshold:
# Mark this as the cache checkpoint
messages.append({
"role": turn["role"],
"content": [
{
"type": "text",
"text": turn["content"],
"cache_control": {"type": "ephemeral"},
}
]
})
else:
messages.append(turn)
# Add new user message (uncached)
messages.append({"role": "user", "content": new_user_message})
return {
"system": [
{
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral"},
}
],
"messages": messages
}The golden rule of cache architecture: Every character before the cache boundary must be identical across requests. Even a timestamp, request ID, or user name injected into the system prompt will break the cache for everything after it. Move variable identifiers into the first uncached message.
Implementation: Anthropic, OpenAI, and Gemini Code Patterns
Each provider's caching API has different surface area. Here are production-ready patterns for each.
Anthropic: Explicit Cache Control
import anthropic
from typing import Any
client = anthropic.Anthropic()
def query_with_cache(
large_document: str,
question: str,
model: str = "claude-opus-4-6",
) -> dict[str, Any]:
"""
Document Q&A with prompt caching.
The document is cached; only the question varies per call.
"""
response = client.messages.create(
model=model,
max_tokens=2048,
system=[
{
"type": "text",
"text": (
"You are a precise document analyst. "
"Answer questions based only on the provided document. "
"If the answer is not in the document, say 'Not found in document.'"
),
"cache_control": {"type": "ephemeral"},
},
{
"type": "text",
"text": f"<document>\n{large_document}\n</document>",
"cache_control": {"type": "ephemeral"},
},
],
messages=[
{
"role": "user",
"content": question,
# No cache_control: varies per request
}
],
)
usage = response.usage
return {
"answer": response.content[0].text,
"cache_creation_tokens": usage.cache_creation_input_tokens,
"cache_read_tokens": usage.cache_read_input_tokens,
"uncached_input_tokens": usage.input_tokens,
"output_tokens": usage.output_tokens,
"cache_hit": usage.cache_read_input_tokens > 0,
}
# Batch questions against same document
questions = [
"What are the payment terms?",
"Who are the authorized signatories?",
"What is the termination clause?",
"What are the liability caps?",
]
for i, question in enumerate(questions):
result = query_with_cache(large_document=contract_text, question=question)
status = "CACHE HIT" if result["cache_hit"] else "CACHE MISS"
print(f"Q{i+1} [{status}]: read={result['cache_read_tokens']:,} tokens")
# Output:
# Q1 [CACHE MISS]: read=0 tokens ← first call writes cache
# Q2 [CACHE HIT]: read=45,230 tokens ← subsequent calls read cache
# Q3 [CACHE HIT]: read=45,230 tokens
# Q4 [CACHE HIT]: read=45,230 tokensOpenAI: Automatic Caching with Monitoring
from openai import OpenAI
from functools import lru_cache
client = OpenAI()
# CRITICAL: The system prompt must be byte-for-byte identical across requests.
# Load it once; do not format or modify it at call time.
SYSTEM_PROMPT = open("system_prompt.txt").read() # Load once at startup
def query_openai_with_cache_monitoring(
user_message: str,
model: str = "gpt-4o",
) -> dict[str, Any]:
"""
OpenAI caches automatically. We just need to ensure the system prompt
is identical across requests (same object reference or string value).
"""
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_message},
],
max_tokens=1024,
)
usage = response.usage
cached_tokens = getattr(
getattr(usage, "prompt_tokens_details", None),
"cached_tokens",
0
)
return {
"content": response.choices[0].message.content,
"total_input_tokens": usage.prompt_tokens,
"cached_tokens": cached_tokens,
"uncached_tokens": usage.prompt_tokens - cached_tokens,
"output_tokens": usage.completion_tokens,
"cache_ratio": cached_tokens / max(usage.prompt_tokens, 1),
}Google Gemini: Explicit Context Cache Objects
import google.generativeai as genai
from google.generativeai import caching
import datetime
genai.configure(api_key="YOUR_API_KEY")
def create_gemini_cache(
large_content: str,
ttl_minutes: int = 60,
model: str = "models/gemini-1.5-pro-001",
) -> str:
"""
Create a named Gemini context cache. Returns the cache name for reuse.
Minimum 32,768 tokens required.
"""
cache = caching.CachedContent.create(
model=model,
contents=[
{
"role": "user",
"parts": [{"text": large_content}]
}
],
system_instruction=(
"You are an expert analyst. Answer questions accurately "
"and cite specific sections from the provided content."
),
ttl=datetime.timedelta(minutes=ttl_minutes),
)
print(f"Cache created: {cache.name}")
print(f"Token count: {cache.usage_metadata.total_token_count:,}")
return cache.name
def query_with_gemini_cache(cache_name: str, question: str) -> str:
"""Query using an existing Gemini context cache."""
cache = caching.CachedContent.get(cache_name)
model = genai.GenerativeModel.from_cached_content(cached_content=cache)
response = model.generate_content(question)
return response.text
# Usage pattern
cache_name = create_gemini_cache(large_content=corpus_text) # One-time setup
answers = [query_with_gemini_cache(cache_name, q) for q in questions]Prompt Caching in RAG Pipelines
RAG pipelines have two distinct caching opportunities: caching the document chunks used as context, and caching the preprocessing step that generates those chunks.
Caching retrieved context. When the same document chunks are retrieved for multiple similar queries, those chunks can be cached. This is particularly effective for FAQ systems, legal document analysis, and product documentation Q&A where a small set of chunks covers most queries.
import hashlib
from anthropic import Anthropic
client = Anthropic()
class CachedRAGPipeline:
"""
RAG pipeline that caches retrieved context to reduce input token costs.
Strategy: if the same chunks are retrieved for multiple queries (common
in domain-specific corpora), the second query onward hits the cache.
"""
def __init__(self, retriever, model: str = "claude-sonnet-3-5"):
self.retriever = retriever
self.model = model
self._last_chunk_hash = None
self._last_chunks = None
def _hash_chunks(self, chunks: list[str]) -> str:
content = "\n---\n".join(sorted(chunks))
return hashlib.sha256(content.encode()).hexdigest()[:16]
def query(self, user_question: str, top_k: int = 5) -> dict:
chunks = self.retriever.retrieve(user_question, top_k=top_k)
chunk_hash = self._hash_chunks(chunks)
context_text = "\n\n".join([
f"[Document {i+1}]\n{chunk}"
for i, chunk in enumerate(chunks)
])
# Build messages with cache control on the context
response = client.messages.create(
model=self.model,
max_tokens=1024,
system=[
{
"type": "text",
"text": (
"You are a helpful assistant. Answer questions based on "
"the provided context. If the context doesn't contain "
"the answer, say so clearly."
),
"cache_control": {"type": "ephemeral"},
},
{
"type": "text",
"text": f"<context>\n{context_text}\n</context>",
"cache_control": {"type": "ephemeral"},
},
],
messages=[{"role": "user", "content": user_question}],
)
cache_hit = response.usage.cache_read_input_tokens > 0
same_chunks = chunk_hash == self._last_chunk_hash
self._last_chunk_hash = chunk_hash
self._last_chunks = chunks
return {
"answer": response.content[0].text,
"cache_hit": cache_hit,
"same_chunks_as_last": same_chunks,
"context_tokens": (
response.usage.cache_read_input_tokens
or response.usage.cache_creation_input_tokens
),
}Caching contextual retrieval preprocessing. When using contextual retrieval (generating 50-100 word context summaries per chunk with Claude Haiku), prompt caching on the document portion reduces the preprocessing cost dramatically. According to Anthropic's published data, this drops contextual retrieval cost from $3.60 per million tokens to $1.02 per million tokens, a 72% reduction.
def generate_chunk_contexts_with_caching(
document: str,
chunks: list[str],
model: str = "claude-haiku-3",
) -> list[str]:
"""
Generate contextual summaries for each chunk.
Cache the full document (stable across all chunks); only the
specific chunk content changes per request.
Cost: ~$1.02/MTok total (vs $3.60 without caching)
because document is cached after first chunk, read for all subsequent.
"""
contexts = []
for chunk in chunks:
response = client.messages.create(
model=model,
max_tokens=150,
system=[
{
"type": "text",
"text": (
"Generate a brief 50-75 word context that situates the "
"given chunk within the overall document. Focus on what "
"the chunk is about and why it matters in context."
),
"cache_control": {"type": "ephemeral"},
},
{
"type": "text",
"text": f"<full_document>\n{document}\n</full_document>",
"cache_control": {"type": "ephemeral"},
# This is the expensive part: cached after first chunk
},
],
messages=[
{
"role": "user",
"content": (
f"Generate context for this chunk:\n\n"
f"<chunk>\n{chunk}\n</chunk>"
)
}
],
)
contexts.append(response.content[0].text)
return contextsThe key insight: a 100-page document preprocessed into 200 chunks sends the full document 200 times without caching. With caching, the document is sent once (cache write at 20% premium) and 199 times at 88% discount. For a 50K-token document at Claude Haiku pricing, this saves approximately $0.00248 per document. Modest individually, but significant at scale.
Agent System Caching Patterns
Agent systems benefit from prompt caching more than almost any other use case because they combine large, stable system prompts with many sequential calls. A typical ReAct agent loop sends the same system prompt, tool definitions, and growing conversation history on every iteration. Without caching, a 10-step agent loop with a 40K-token system prompt costs 400K input tokens in system prompt tokens alone.
Agent loop caching strategy:
import anthropic
import json
from typing import Callable
client = anthropic.Anthropic()
TOOLS = [
{
"name": "search_web",
"description": "Search the web for current information on any topic.",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"num_results": {"type": "integer", "default": 5}
},
"required": ["query"]
}
},
{
"name": "execute_code",
"description": "Execute Python code and return the output.",
"input_schema": {
"type": "object",
"properties": {
"code": {"type": "string"},
"timeout_seconds": {"type": "integer", "default": 30}
},
"required": ["code"]
}
},
# ... more tools
]
# Pre-serialize tools once: same bytes on every call
TOOLS_JSON = json.dumps(TOOLS, indent=2)
AGENT_SYSTEM_PROMPT = f"""You are a precise research and analysis agent.
## Capabilities
You have access to web search and code execution tools.
Use tools judiciously; only when needed to answer the question accurately.
## Reasoning Pattern
1. Analyze the user's request
2. Identify what information or computation is needed
3. Use tools to gather/compute
4. Synthesize results into a clear answer
## Tools Available
{TOOLS_JSON}
## Output Format
Think step by step. Use <thinking> tags for reasoning, then give your final answer."""
class CachedAgentLoop:
"""
Agent loop that maximizes cache utilization across iterations.
Cache layers:
- Layer 1: System prompt + tool definitions (never changes)
- Layer 2: Conversation history up to current turn (stable portion)
- Uncached: Latest user message and pending tool responses
"""
def __init__(self, model: str = "claude-opus-4-6"):
self.model = model
self.conversation: list[dict] = []
self.total_cache_writes = 0
self.total_cache_reads = 0
self.total_input_tokens = 0
def _build_messages_with_cache(self, new_user_message: str) -> list[dict]:
"""Add cache control to stable conversation history."""
messages = []
# Mark all but the last 2 turns as cacheable
cache_boundary = max(0, len(self.conversation) - 2)
for i, msg in enumerate(self.conversation):
if i == cache_boundary - 1 and len(self.conversation) >= 4:
# Cache checkpoint on the last stable turn
if isinstance(msg["content"], str):
messages.append({
"role": msg["role"],
"content": [
{
"type": "text",
"text": msg["content"],
"cache_control": {"type": "ephemeral"},
}
]
})
else:
messages.append(msg)
else:
messages.append(msg)
# New message is never cached
messages.append({"role": "user", "content": new_user_message})
return messages
def step(
self,
user_message: str,
tool_executor: Callable[[str, dict], str],
) -> str:
"""Run one complete agent turn with tool calls."""
messages = self._build_messages_with_cache(user_message)
while True:
response = client.messages.create(
model=self.model,
max_tokens=4096,
system=[
{
"type": "text",
"text": AGENT_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"},
}
],
messages=messages,
)
# Track cache usage
self.total_cache_writes += response.usage.cache_creation_input_tokens
self.total_cache_reads += response.usage.cache_read_input_tokens
self.total_input_tokens += response.usage.input_tokens
# Check stop reason
if response.stop_reason == "end_turn":
final_text = response.content[-1].text
# Update conversation history for next turn
self.conversation.append({"role": "user", "content": user_message})
self.conversation.append({"role": "assistant", "content": final_text})
return final_text
# Handle tool calls
tool_calls = [b for b in response.content if b.type == "tool_use"]
tool_results = []
for tool_call in tool_calls:
result = tool_executor(tool_call.name, tool_call.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": tool_call.id,
"content": result,
})
# Append assistant response and tool results
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
def get_cache_stats(self) -> dict:
total_billable = self.total_cache_writes + self.total_cache_reads + self.total_input_tokens
return {
"cache_write_tokens": self.total_cache_writes,
"cache_read_tokens": self.total_cache_reads,
"uncached_tokens": self.total_input_tokens,
"cache_hit_rate": (
self.total_cache_reads / max(total_billable, 1) * 100
),
}For a 10-step agent loop with a 40K-token system prompt and 500-token average conversation:
- Without caching: 10 × (40,000 + 500) = 405,000 input tokens at $3.00/MTok = $1.22
- With caching: 1 write (40K×$3.75/MTok) + 9 reads (40K×$0.30/MTok) + 10×500 uncached = $0.15 + $0.108 + $0.015 = $0.273
- Savings: 77.6% for a single agent run. At 10,000 agent runs per month, this saves $9,470/month.
Measuring Cache Efficiency in Production
Cache efficiency in production requires tracking at the call level, not the aggregate. A high average cache hit rate can mask critical patterns. Certain request types might have 0% cache hits due to prompt construction bugs, while others hit 99%.
import time
from dataclasses import dataclass, field
from collections import defaultdict
from typing import Optional
import anthropic
@dataclass
class CacheMetrics:
"""Per-call caching metrics."""
timestamp: float
endpoint: str
cache_write_tokens: int
cache_read_tokens: int
uncached_input_tokens: int
output_tokens: int
latency_ms: float
@property
def total_input_tokens(self) -> int:
return self.cache_write_tokens + self.cache_read_tokens + self.uncached_input_tokens
@property
def cache_hit_rate(self) -> float:
if self.total_input_tokens == 0:
return 0.0
return self.cache_read_tokens / self.total_input_tokens
def cost_anthropic_sonnet(self) -> float:
"""Calculate actual cost in dollars."""
return (
self.cache_write_tokens * 3.75e-6
+ self.cache_read_tokens * 0.30e-6
+ self.uncached_input_tokens * 3.00e-6
+ self.output_tokens * 15.00e-6
)
def counterfactual_cost(self) -> float:
"""Cost if caching were disabled."""
return (
self.total_input_tokens * 3.00e-6
+ self.output_tokens * 15.00e-6
)
class CacheMonitor:
"""Production cache efficiency monitor."""
def __init__(self):
self._metrics: list[CacheMetrics] = []
self._by_endpoint: dict[str, list[CacheMetrics]] = defaultdict(list)
def record(self, metrics: CacheMetrics) -> None:
self._metrics.append(metrics)
self._by_endpoint[metrics.endpoint].append(metrics)
def summary(self, endpoint: Optional[str] = None) -> dict:
data = self._by_endpoint[endpoint] if endpoint else self._metrics
if not data:
return {}
total_calls = len(data)
total_cost = sum(m.cost_anthropic_sonnet() for m in data)
counterfactual = sum(m.counterfactual_cost() for m in data)
avg_hit_rate = sum(m.cache_hit_rate for m in data) / total_calls
return {
"total_calls": total_calls,
"avg_cache_hit_rate": f"{avg_hit_rate * 100:.1f}%",
"total_cost": f"${total_cost:.4f}",
"counterfactual_cost": f"${counterfactual:.4f}",
"savings": f"${counterfactual - total_cost:.4f}",
"savings_pct": f"{(1 - total_cost/counterfactual) * 100:.1f}%" if counterfactual > 0 else "N/A",
"zero_hit_calls": sum(1 for m in data if m.cache_hit_rate == 0),
}
def alert_low_cache_rate(self, threshold: float = 0.5) -> list[str]:
"""Return endpoint names with suspiciously low cache hit rates."""
alerts = []
for endpoint, metrics in self._by_endpoint.items():
if len(metrics) < 10:
continue
avg_hit = sum(m.cache_hit_rate for m in metrics[-50:]) / min(len(metrics), 50)
if avg_hit < threshold:
alerts.append(
f"LOW CACHE HIT: {endpoint} averaging {avg_hit*100:.1f}% "
f"(threshold: {threshold*100:.0f}%)"
)
return alertsKey metrics to track in production:
- Cache hit rate by endpoint: sudden drops indicate prompt construction changes that broke the cache prefix
- Cold start rate: percentage of calls that triggered cache writes; high rates suggest TTL expiry or cache misses
- Cost per 1K calls by endpoint: the most actionable metric; reveals which endpoints have optimization headroom
- TTFT (time to first token): cached calls should be 60-80% faster; deviations suggest cache misses
Common diagnostic patterns:
| Observation | Likely Cause | Fix |
|---|---|---|
| 0% cache hit on all calls | Dynamic content in stable layer | Move variable content after cache breakpoint |
| Cache hit rate drops after deploy | Prompt template changed | Check for whitespace/formatting changes |
| Cache hits but no latency reduction | Cache hit on small prefix only | Move cache boundary earlier in prompt |
| High cache write rate | TTL expiring before reuse | Check call frequency vs 5-minute TTL |
| Gemini cache not helping | <32K tokens in cache | Consolidate more content into cache |
Common Mistakes and Cache AntiPatterns
Anti-pattern 1: Injecting dynamic content into the system prompt.
# BAD: Timestamp breaks the cache for everything after it
system = f"""
You are a helpful assistant.
Current time: {datetime.now().isoformat()} # ← Cache breaker!
User ID: {user_id} # ← Cache breaker!
[40,000 tokens of instructions and examples follow]
"""
# GOOD: Stable system prompt, dynamic context in first user message
system = """
You are a helpful assistant.
[40,000 tokens of instructions and examples]
"""
messages = [
{
"role": "user",
"content": f"[Context: {datetime.now().strftime('%Y-%m-%d')}, User: {user_id}]\n\n{actual_question}"
}
]Anti-pattern 2: Sorting tool schemas dynamically.
Tool schemas are large (5K-20K tokens). If your code rebuilds tool definitions per request (filtering to only "relevant" tools, sorting by category, or fetching from a database), the serialized JSON will differ between calls and the cache will never hit.
# BAD: Dynamic tool filtering
tools = [t for t in ALL_TOOLS if t["name"] in user_permissions[user_id]]
# Different users → different tool lists → no cache hits
# GOOD: Cache the full schema; handle authorization in tool execution
tools = ALL_TOOLS # Always the same bytes
# In tool executor: check permissions before running the toolAnti-pattern 3: Template formatting that introduces whitespace variation.
# BAD: f-string indentation that varies with content
prompt = f"""
System: {system_text}
Examples: {examples_text}
Instructions: {instructions}
"""
# Leading spaces vary if text wraps, causing cache miss
# GOOD: No formatting ambiguity
prompt = f"System: {system_text}\nExamples: {examples_text}\nInstructions: {instructions}"Anti-pattern 4: Ignoring cache TTL in batch workloads.
Anthropic's 5-minute TTL means a batch job that processes records at 1 per 10 minutes will always have a cache miss. Cache TTLs reset on each hit, so the solution is either to process records in bursts (many per minute) or to use a tier with longer TTLs.
# BAD: Slow batch with gap > TTL
for record in records: # 10 min/record
process(record) # Always cache miss
time.sleep(600)
# GOOD: Burst processing to keep cache warm
BATCH_SIZE = 50
for i in range(0, len(records), BATCH_SIZE):
batch = records[i:i + BATCH_SIZE]
results = [process(r) for r in batch] # 50 calls in <5 min → cache stays warm
store_results(results)
# Optional: small pause between batches if neededAnti-pattern 5: Caching too aggressively in multi-tenant systems.
Anthropic's caching is scoped to your API key, not per-user. If you use the same API key for multiple tenants and one tenant's data appears in a cached prompt, another tenant may read from that cache. Always ensure cached content is either generic (system instructions, tool schemas) or tenant-scoped (use per-tenant API keys or ensure cache boundaries never cross tenant data).
When caching does NOT help:
- Single-use queries with unique large contexts (one-off document summaries)
- Prompts under the minimum token threshold (1K for Anthropic/OpenAI, 32K for Gemini)
- Workloads where call frequency is below the break-even threshold
- Streaming responses with very short prompts where TTFT is already <100ms
- Prompts where every field is per-user (user profile, personalized instructions)
Key Takeaways
-
Prompt caching reuses computed KV cache across API requests. It requires that the cached token prefix be byte-for-byte identical across calls. Any change in the stable prefix (including whitespace or dynamic fields) causes a cache miss and triggers full recomputation.
-
Anthropic's caching model is the most aggressive discount. Cache reads cost 90% less than normal input tokens. The break-even is 1.28 calls. After the second call to any Anthropic endpoint with the same prefix, you are saving money.
-
OpenAI offers automatic 50% caching with zero friction. No explicit cache control is needed; the API handles it transparently. Smaller discount than Anthropic, but no risk of cache architecture mistakes.
-
Gemini context caching targets very large, high-frequency workloads. The 32K minimum token threshold and per-hour storage cost mean Gemini caching only delivers positive ROI for high-volume workloads with very large documents.
-
Cache-optimal prompt architecture puts stable content first. The ordering is: role/persona, then task instructions, then tool schemas, then few-shot examples, then reference documents, then conversation history checkpoint, then dynamic current input. Never inject timestamps, user IDs, or per-request data into the stable prefix.
-
Agent loop caching is the highest-impact use case. A 10-step agent loop with a 40K-token system prompt saves 77.6% on input costs with caching. The system prompt is read 10 times but paid for at the discounted rate after the first write.
-
Cache efficiency monitoring requires per-endpoint tracking. A 0% cache hit rate on a specific endpoint indicates a prompt construction bug (dynamic content in the stable layer). Alert on sudden drops in cache hit rate as deployment validation.
-
The contextual retrieval + prompt caching combination reduces preprocessing costs from $3.60 to $1.02 per million tokens. The full document portion of the preprocessing prompt is cached across all chunk context generations.
FAQ
FAQ
How does prompt caching work technically?
Prompt caching reuses the key-value (KV) cache from the transformer's attention computation. During the prefill pass, each input token generates key and value tensors that are stored to avoid recomputation during generation. With caching enabled, providers store this KV cache on their servers after the first request. Subsequent requests that begin with the same token sequence skip directly to the first novel token, paying compute only for the new portion. The prefix must be byte-for-byte identical; a single different character invalidates all subsequent cached entries.
FAQ
What is the breakeven point for Anthropic prompt caching?
For Claude Sonnet 3.5, the break-even is 1.28 calls. The second call to any endpoint that caches a 1,024+ token prefix already returns a net cost saving. The formula is: (cache_write_cost - cache_read_cost) / (normal_cost - cache_read_cost) = ($3.75 - $0.30) / ($3.00 - $0.30) = 1.28. For most production workloads where the same system prompt is called hundreds or thousands of times daily, caching reduces costs by 85-90%.
FAQ
Why does my prompt caching have a 0% hit rate?
A 0% cache hit rate almost always means dynamic content is being injected into the stable prefix. Common causes include: timestamps or request IDs in the system prompt, per-user identifiers before the cache breakpoint, tool schemas that are filtered or sorted dynamically per request, or f-string formatting that introduces whitespace variations. Check that the byte sequence of everything before the cache_control boundary is identical across requests. A single character difference (including trailing spaces or newline style) will cause a cache miss.
FAQ
Does prompt caching work with streaming responses?
Yes. Caching affects the prefill (input processing) phase, not the generation (output streaming) phase. When streaming with a cached prefix, the time-to-first-token (TTFT) is reduced by approximately 79% for a 100K-token cached prompt: from about 11.5s to 2.4s. The token streaming rate after the first token is unaffected by caching. This makes caching valuable for interactive streaming applications with large system prompts.
FAQ
Is prompt caching safe in multitenant applications?
Prompt caching is scoped to your API key, not per-user. Content in a cached prompt is not shared between different API keys, but within the same API key, any request that begins with the same prefix can hit the cache. This means cached content must be either non-sensitive (system instructions, tool schemas) or you must use separate API keys per tenant. Never include tenant-specific data (user names, account details, personal information) in the cacheable prefix of a shared API key.
FAQ
How does Anthropic's 5minute TTL affect batch workloads?
Anthropic's cache TTL resets to 5 minutes on each cache hit. For caching to remain active across a long batch job, requests must come in faster than once every 5 minutes. If processing is slower than that, the cache expires and every call pays the write premium again. The solution is to process records in bursts (batches of 50-100 records processed rapidly) rather than one per minute. Alternatively, for very infrequent calls, it may be more cost-effective to use model routing (switching to a smaller model) than to rely on caching.
Anthropic. "Prompt Caching." Claude API Documentation, 2025. https://docs.anthropic.com/claude/docs/prompt-caching
OpenAI. "Prompt Caching." OpenAI Platform Documentation, 2025. https://platform.openai.com/docs/guides/prompt-caching
Google. "Context Caching." Gemini API Documentation, 2025. https://ai.google.dev/gemini-api/docs/caching
Anthropic. "Contextual Retrieval." Anthropic Engineering Blog, 2024. https://www.anthropic.com/news/contextual-retrieval
Dao, Tri, et al. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022. https://arxiv.org/abs/2205.14135
<author-bio>Chait builds EdgeLM, an inference engine optimized for low-latency LLM deployment, and Authos, an autonomous SEO platform. He writes about LLM inference internals, agent architecture, and AI infrastructure. Cost calculations are based on publicly listed API pricing as of Q1 2026.
Written & published by Chaitanya Prabuddha