
OpenClaw: AI Agent That Ships Code While You Sleep (2026)

Bradley Herman

KV cache is a critical optimization technique that stores previously computed Key and Value tensors during transformer inference, eliminating redundant calculations and reducing generation complexity from O(T³) to O(T²). This translates to 2-24x throughput improvements in production systems. Without KV cache, each new token would require recomputing attention for all previous tokens; with it, only the new token's projections need computation while cached values are reused. Memory consumption scales linearly with sequence length (a 70B model at 128K context needs ~40GB for cache alone), so production deployments combine KV cache with techniques like PagedAttention, quantization, and continuous batching to maximize efficiency.
Watch a 70B model generate text without optimization, and you'll see it sweat. Each new token triggers a cascade—the model recomputes attention for every previous token, again and again and again. Token 100? Recompute 1-99. Token 101? Recompute 1-100. The redundancy is brutal.
Without optimization, the self-attention mechanism requires recomputing key (K) and value (V) vectors for all previous tokens at each generation step, resulting in O(T³) total complexity for generating T tokens. KV cache kills this redundancy dead by stashing previously computed attention values, reducing per-token complexity from O(t²) to O(t) at position t. The payoff? Production systems report throughput improvements up to 5-24x with continuous batching and PagedAttention, while specialized configurations achieve up to 9.9x reductions in Time-to-First-Token latency.
If you're building anything that generates text with transformers, you need to understand this.
KV cache is an optimization technique that stores the Key and Value projection vectors computed during transformer self-attention, eliminating redundant calculations during autoregressive text generation.
The specific problem it solves: every time a transformer generates a new token, the attention mechanism needs access to the key and value vectors for all previous tokens. Without caching, the model recomputes these vectors from scratch at every step. Token 100 requires computing K and V for tokens 1-99. Token 101 requires computing K and V for tokens 1-100. The redundancy is staggering.
KV cache sits at the heart of transformer inference optimization. While techniques like quantization and pruning compress the model itself, KV cache optimizes the generation process. It's not optional for production deployment—it's the baseline that makes everything else possible.
Transformer attention operates through a precise mathematical pipeline. Input embeddings get projected into three matrices: Query (Q), Key (K), and Value (V). The core formula:
Attention(Q,K,V) = softmax(Q·K^T / √d_k) · V
Think of it this way:
During generation, the Query vector for each new token attends to all previously cached Key vectors to determine relevance, then retrieves information from the corresponding cached Value vectors. This asymmetry—where a single new Query attends to a growing set of cached Keys and Values—enables the efficiency gains of KV caching.
Here's where things get expensive. At each generation step, the attention formula requires:
Attention(Q_t, K_{1:t}, V_{1:t})
Where K_{1:t} means "all keys from position 1 to position t."
Without caching, generating token 100 means:
This requires O(100²) operations per token at position 100, scaling to O(T³) total complexity for generating T tokens. With KV caching, step 1 is eliminated—K_{1:99} and V_{1:99} are retrieved from cache instead, reducing per-token complexity to O(100) and total generation complexity to O(T²). Night and day.
Generating token 101 means:
The redundancy compounds. Per-token complexity becomes O(t), and total generation complexity becomes O(T²) for T tokens.
The fix is elegant: stash what you've already figured out.
Step 1 — Initial Token (Prefill Phase): Process the input prompt, computing Q, K, and V for all tokens. Store K and V in the cache.
Step 2 — Subsequent Tokens (Decode Phase): For each new token:
K_full = [K_cached, K_new]# Pseudocode for the decode phase
K_cached, V_cached = past_key_values[layer_idx]
K_full = torch.cat([K_cached, K_new], dim=1)
V_full = torch.cat([V_cached, V_new], dim=1)
attention_output = attention(Q_new, K_full, V_full)
past_key_values[layer_idx] = (K_full, V_full)The asymmetry is key: Q is only for the new token, but it attends to all cached keys and values.
Without KV cache at sequence length 2,000: ~4,000,000 operations per token.
With KV cache at sequence length 2,000: ~2,000 operations per token.
A 2,000x reduction in operations per token compared to naive recomputation at 2,000-token sequence lengths. Brutal math.
So that's how it works. Now let's talk about what you're trading away.
KV cache trades memory for speed. The formula for memory consumption:
Memory = 2 × batch_size × num_layers × num_heads × seq_length × head_dim × bytes_per_element
The factor of 2 in the KV cache memory formula accounts for storing both Key (K) and Value (V) tensors separately, as defined in the complete memory calculation: Memory_KV = 2 × batch_size × num_layers × num_heads × seq_len × d_head × bytes_per_element.
Concrete example for a 70B parameter model in float16:
Sequence length is the killer. Memory grows relentlessly with sequence length—doubling it doubles memory requirements. Going from 512 to 4,096 tokens (an 8x increase) requires 8x more memory for KV cache storage.
Hardware implications:
Each transformer layer maintains its own cache:
past_key_values = [
(keys_layer_0, values_layer_0),
(keys_layer_1, values_layer_1),
# ... for each transformer layer
]Each layer's cache has shape [batch_size, num_heads, sequence_length, head_dim].
This architecture means cache memory scales linearly with model depth. A 32-layer model needs 32× the per-layer cache of a single layer.
Here's where things get interesting for production systems.
Traditional approaches pre-allocate contiguous memory blocks for worst-case sequence lengths, wasting 60-80% of GPU memory. Modern systems like vLLM use PagedAttention to allocate memory in fixed-size blocks, reducing memory waste to under 4%:
Continuous batching takes this further—new requests join ongoing batches dynamically, rather than waiting for batch boundaries. Combined with KV cache optimization, this achieves up to 23x throughput improvements compared to static batching approaches. That's real money saved.
Enough theory. Let's write some code.
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
prompt = "The future of AI is"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
# WITH cache (fast)
outputs = model.generate(
input_ids,
max_new_tokens=50,
use_cache=True, # Enable caching
return_dict_in_generate=True
)
generated_text = tokenizer.decode(outputs.sequences[0])The use_cache=True parameter enables caching during model generation. When you run this code, the model processes your prompt in a single forward pass (the prefill phase), storing all computed K and V tensors. For each subsequent token, only the new token's projections get computed—the rest comes straight from cache. This is why generation feels snappy even for longer outputs.
Explicitly setting this parameter is important for controlling KV cache behavior, as proper cache management requires intentional configuration rather than relying on defaults.
For finer control:
from transformers import DynamicCache
import torch
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
cache = DynamicCache()
prompt = "The quick brown fox"
inputs = tokenizer(prompt, return_tensors="pt")
generated_ids = inputs.input_ids
for _ in range(20):
outputs = model(
input_ids=generated_ids if cache.get_seq_length() == 0 else generated_ids[:, -1:],
past_key_values=cache,
use_cache=True,
return_dict=True
)
cache = outputs.past_key_values
next_token = torch.argmax(outputs.logits[:, -1, :], dim=-1, keepdim=True)
generated_ids = torch.cat([generated_ids, next_token], dim=-1)
if next_token.item() == tokenizer.eos_token_id:
break
print(f"Generated: {tokenizer.decode(generated_ids[0])}")
print(f"Cache sequence length: {cache.get_seq_length()}")Notice the key pattern here: after the first forward pass, you only pass the last token (generated_ids[:, -1:]), not the full sequence. This is the entire point of KV caching. The conditional if cache.get_seq_length() == 0 handles the initial pass where there's no cache yet—you need the full input. Every subsequent iteration passes only the single new token, and the model combines it with the cached context. Watch the cache.get_seq_length() grow by one each iteration—that's your cache accumulating the conversation history.
import time
import torch
model = AutoModelForCausalLM.from_pretrained("gpt2").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
inputs = tokenizer("Once upon a time", return_tensors="pt").to("cuda")
# WITH cache
start = time.time()
with torch.no_grad():
outputs_cached = model.generate(**inputs, max_new_tokens=100, use_cache=True)
time_cached = time.time() - start
# WITHOUT cache
start = time.time()
with torch.no_grad():
outputs_no_cache = model.generate(**inputs, max_new_tokens=100, use_cache=False)
time_no_cache = time.time() - start
print(f"With cache: {time_cached:.2f}s")
print(f"Without cache: {time_no_cache:.2f}s")
print(f"Speedup: {time_no_cache/time_cached:.1f}x")On a typical GPU, expect 3-5x speedups for this 100-token generation. The gap widens dramatically with longer sequences—at 500 tokens you might see 8-10x, and at 2,000 tokens the difference becomes almost absurd. The reason is the O(T²) vs O(T³) complexity difference: without caching, each additional token makes all previous tokens more expensive to process. With caching, each token costs roughly the same regardless of position.
Production systems achieve 2-24x throughput improvements depending on implementation (2-4x with PagedAttention, up to 15x with LMCache), and specialized deployments see 9.9x latency reductions for time-to-first-token on 100K+ token contexts.
Hugging Face provides several cache implementations:
| Cache Type | Use Case | JIT Compatible |
|---|---|---|
DynamicCache | Default, flexible generation | No |
StaticCache | Production with torch.compile | Yes |
QuantizedCache | Memory-constrained environments | No |
OffloadedCache | Very long sequences | No |
For production with torch.compile:
model = AutoModelForCausalLM.from_pretrained("gpt2")
compiled_model = torch.compile(model, mode="reduce-overhead")
outputs = compiled_model.generate(
**inputs,
max_new_tokens=30,
cache_implementation="static",
do_sample=False
)# CORRECT: KV cache accumulates during generation, enabling efficient inference
# Cache is automatically managed by transformers and should be cleared between separate inference calls
for prompt in prompts:
outputs = model.generate(
tokenizer(prompt, return_tensors="pt").input_ids,
use_cache=True, # Enable KV caching for efficient token generation
max_new_tokens=50
)
# CORRECT: Explicitly manage cache lifecycle
for prompt in prompts:
model.generation_config.cache_implementation = None # Reset
outputs = model.generate(tokenizer(prompt, return_tensors="pt").input_ids)Quantization:
INT8 and FP8 quantization reduce the precision of cached K and V tensors from 16-bit floating point to 8-bit integers or floats. This cuts memory consumption in half with minimal impact on output quality—typically less than 1% degradation on standard benchmarks. The tradeoff is straightforward: you're betting that the attention mechanism doesn't need full precision to make good decisions about which tokens matter. For most production workloads, this bet pays off handsomely. FP8 is generally preferred when your hardware supports it natively (H100 and newer), as it maintains better numerical properties than INT8.
CPU Offloading:
When GPU memory becomes the bottleneck, offloading KV cache to CPU memory keeps generation possible at the cost of latency. Each token generation requires transferring cached tensors from CPU to GPU, computing attention, and potentially writing updated cache back. On systems with NVLink or CXL interconnects (like NVIDIA Grace-Hopper), this penalty is manageable—900 GB/s bandwidth means a few milliseconds per token. On standard PCIe connections, expect 10-50ms overhead per token depending on cache size. Use this when you need to serve a model that simply won't fit otherwise, but understand you're trading significant latency for the privilege.
PagedAttention (vLLM):
PagedAttention treats KV cache memory like a virtual memory system—instead of pre-allocating one contiguous block per sequence, it allocates fixed-size pages on demand. Think of it like the difference between reserving an entire hotel floor versus booking individual rooms as needed. Traditional allocation wastes 60-80% of memory on empty space reserved for sequences that might grow longer. PagedAttention reduces this waste to under 4%. The implementation maintains a page table mapping logical token positions to physical memory locations, enabling sequences to grow without expensive memory copies. This is why vLLM achieves such dramatic throughput improvements.
For high-throughput serving, memory allocation strategy matters enormously. Allocate up to 90% of free GPU memory for KV cache—leaving headroom causes underutilization, but going too aggressive triggers OOM errors during traffic spikes. The remaining 10% handles activation memory and framework overhead.
Enable continuous batching to maximize GPU utilization. Without it, a batch of 8 requests where one generates 500 tokens and seven generate 50 tokens will hold all eight slots until the longest finishes. With continuous batching, the seven short requests complete and get replaced while the long one continues, keeping throughput high.
Prefix caching deserves special attention for applications with repeated context. If every request starts with the same 2,000-token system prompt, computing K and V for those tokens once and reusing them across requests saves massive computation. Production systems report 10x cost reduction on cached prefixes.
Monitor cache hit rates continuously. High eviction frequency—where the system constantly throws away cached data to make room for new requests—indicates you've undersized your cache allocation relative to your traffic pattern. Either add memory, reduce batch sizes, or implement smarter eviction policies that preserve high-value cached content.
OpenAI plays the caching game aggressively with GPT-4. Cache hits deliver up to 80% latency reduction, and cached content costs 90% less in input token pricing. The cache retention policy is dynamic—frequently-accessed prefixes stick around for hours, while one-off content might evaporate in minutes. This creates interesting optimization opportunities: if your application sends the same system prompt repeatedly, you're essentially getting it for free after the first request. Batch similar requests together in time, and you'll see your costs drop substantially.
Google implements a three-tier storage architecture that reflects the memory hierarchy realities of large-scale inference:
Hot cache data lives in HBM for instant access. Warm data drops to CPU memory, adding milliseconds of latency but keeping content available. Cold data can persist to SSD for very long-running conversations. Results from Google DeepMind's tiered KV cache system demonstrate: 79% reduction in Time-to-First-Token, 264% increase in input throughput for 100K+ token contexts. The tiered approach lets them serve million-token contexts that would be impossible with GPU-only caching.
LLaMA uses Grouped Query Attention (GQA) architecturally—a training-time decision that fundamentally changes how KV cache scales. Standard multi-head attention gives each query head its own key and value head, meaning cache size scales with total head count. GQA shares key-value heads across groups of query heads, reducing cache to 12.5-25% of standard size depending on group configuration. The tradeoff happens during training: you're betting that shared key-value representations capture enough information for multiple query perspectives. Meta's extensive ablations showed this bet pays off—near-identical quality with dramatically better inference characteristics. Combined with quantization, this architectural choice delivers:
LMCache achieves up to 15x throughput improvement and up to 2x faster token generation latency through cross-query cache sharing and prefill-decode disaggregation, as documented in the LMCache technical report.
KVLink optimizes inference for document-heavy workloads by precomputing KV tensors for documents and concatenating them with positional embedding adjustments, reducing the need to recompute key-value cache for repeated document contexts.
The lesson from these production systems is consistent: the biggest wins come from avoiding redundant computation across requests, not just within a single generation. If two users ask questions about the same document, computing that document's KV cache once and sharing it delivers order-of-magnitude improvements.
KV cache isn't an alternative to other optimization techniques—it's the foundation that makes transformer inference practical at all. Understanding how other techniques interact with KV cache helps you build an effective optimization stack.
| Technique | What It Does | Memory Impact | Performance Improvement | Stacks With KV Cache? |
|---|---|---|---|---|
| KV Cache | Stores attention K/V tensors | Scales linearly with sequence length | 2-10x throughput (vs non-cached) | N/A (foundational) |
| Quantization | Reduces K/V precision to INT8/FP8 | 50-75% cache memory reduction | 1.5-3x throughput with native support | Yes—quantize cached tensors |
| Distillation | Smaller model with retraining | 40-60% model size reduction | 2-4x latency reduction | Yes—smaller model = smaller cache |
| Pruning | Removes attention heads/weights | Variable reduction per method | Variable (hardware-dependent) | Yes—fewer active heads = smaller cache |
| Speculative Decoding | Draft model generates tokens in parallel | Minimal overhead (shared cache) | 2-3x latency reduction (no accuracy loss) | Yes—both models share KV cache |
| MQA/GQA | Single or grouped K/V heads | 3-25% of standard cache size | 1.5-2x inference speed | Yes—architectural choice reduces cache |
The key insight: KV cache provides the baseline that everything else builds on. Without it, you're starting from such a bad position that other optimizations barely matter. With it, you can stack additional techniques to squeeze out another 2-10x depending on your constraints.
When KV cache alone isn't enough:
The stack for maximum performance:
All this sounds great, but there's no free lunch.
For a 70B parameter model at 128K context length: approximately 40 GB just for KV cache. This often exceeds available HBM.
The three failure modes:
Multi-Query Attention (MQA):
Grouped Query Attention (GQA):
Sliding Window Attention:
Not clearing cache between sessions creates insidious bugs. The cache accumulates state from previous generations, causing the model to condition on irrelevant context. Symptoms include nonsensical outputs, repeated phrases from earlier prompts, or mysterious OOM errors that only appear after running for a while.
Assuming linear cache growth trips up developers working with modern architectures. Models using sliding window attention (Mistral, some LLaMA variants) plateau or even shrink their cache after reaching the window size. If you're budgeting memory based on sequence length × constant, you'll over-allocate for these models.
Using cache during training causes dimension mismatches and incorrect gradients. Training uses teacher forcing where the model sees the entire target sequence simultaneously—there's no sequential generation that would benefit from caching. The use_cache parameter should always be False during training.
Ignoring hardware limits leads to production outages. Memory requirements compound: model weights + KV cache + activation memory + framework overhead must all fit. A model that runs fine on your development machine might OOM in production with different batch sizes or longer contexts.
Why doesn't the model recompute Q as well as K and V?
Q (Query) represents "what the current token is looking for." It only matters for the token being generated right now. Previous tokens' queries are irrelevant for the current attention computation—we only need their keys and values.
Can I use KV cache during training?
No. Training uses teacher forcing where the model sees the entire target sequence simultaneously—all positions are processed in parallel, not sequentially. There's no "previous token" in the sense that inference has, because the model attends to the full sequence at once. Caching would add memory overhead without any computational benefit, and the cache update logic would interfere with proper gradient computation.
How do I know if KV cache is actually being used?
Check the cache object after generation:
outputs = model.generate(..., return_dict_in_generate=True)
if outputs.past_key_values:
print(f"Cache has {len(outputs.past_key_values)} layers")What happens when the cache gets too large?
Options: quantize the cache, offload to CPU, use PagedAttention for dynamic allocation, or implement sliding window attention to bound cache size.
Does batch size affect cache memory linearly?
Yes. Batch size 8 requires 8× the cache memory of batch size 1 because KV cache memory scales linearly with batch size according to the formula: Memory_KV = 2 × batch_size × num_layers × num_heads × seq_len × head_dim × bytes_per_element. This is why production systems carefully balance batch size against available memory, using strategies like PagedAttention for efficient allocation and continuous batching to maximize GPU utilization.
Can different requests share KV cache?
Yes—this is called prefix caching. According to OpenAI's official documentation, requests with identical prefixes (system prompts, few-shot examples) can reuse cached K/V tensors. Anthropic similarly implements prompt caching with KV matrix reuse for cached content, though both approaches achieve substantially different performance profiles in production.
What's the difference between DynamicCache and StaticCache?
DynamicCache provides flexible memory management that grows dynamically with sequence length, while StaticCache pre-allocates fixed-size memory blocks and enables JIT compilation for optimized inference performance. The choice between them involves trade-offs: DynamicCache offers flexibility for variable-length sequences but cannot be compiled with torch.compile, whereas StaticCache requires knowing the maximum sequence length in advance but unlocks compiler optimizations.
How does KV cache work with multi-GPU setups?
Tensor parallelism and pipeline parallelism are multi-GPU distribution strategies for handling large models. Tensor parallelism splits model layers across GPUs, potentially improving latency for KV cache operations, while pipeline parallelism assigns different layers to different GPUs, optimizing throughput. Both techniques require careful memory planning to balance cache distribution, GPU utilization, and communication overhead across the cluster.
KV cache transforms transformer inference from "theoretically possible" to "production-ready." Without it, generating long sequences would be computationally prohibitive.
There's something elegant about KV cache—it embodies a fundamental truth of computation: remembering is almost always cheaper than recomputing. The trick is knowing what to remember, how long to keep it, and when to let it go.
Key takeaways:
use_cache=True for any generation taskbatch_size × layers × heads × seq_length × head_dimNext steps:
If you're deploying models, start with Hugging Face's DynamicCache and measure your memory consumption. For long sequences or high throughput requirements, implement PagedAttention (achieves 2-4x GPU utilization improvement) through vLLM to enable non-contiguous memory allocation. If you're hitting memory limits, combine PagedAttention with quantization (FP8 for 75% reduction, NVFP4 for long-context scenarios), or consider Grouped Query Attention (GQA) if retraining is feasible for 12.5-25% cache reduction. For production deployment at scale, TensorRT-LLM provides CUDA kernel optimizations alongside PagedAttention, with priority-based cache management for approximately 20% improvement in cache hit rates.
If you're building models, consider GQA architecture during training—it's a permanent decision that pays dividends in deployment.
This guide focuses specifically on KV cache optimization for transformer inference. KV cache is a foundational technique used in production LLM systems, reducing autoregressive generation complexity from O(T³) to O(T²) and enabling substantial throughput improvements when combined with other optimizations like quantization and continuous batching. Understanding KV cache moves you from "I can run a model" to "I can deploy a model efficiently."
That's a meaningful difference—the difference between a demo and a product. Now go make something fast.

Sergey Kaplich