KV Cache

TL;DR: KV cache speeds up LLM text generation by storing computed key and value tensors from previous tokens, enabling fast incremental generation without reprocessing the entire context. It delivers 5x faster inference compared to recomputation during autoregressive generation, but comes with linear memory growth that scales with sequence length, batch size, and model layers—quickly consuming your GPU budget.

What you need to know

KV cache is a computational optimization specific to transformer-based language models that stores key (K) and value (V) tensors from attention layers, solving a fundamental O(m²·d) bottleneck in autoregressive text generation by eliminating redundant recomputation of key and value projections for previously generated tokens.

Here's the problem: every time your LLM generates a new token, it needs to compute attention over all previously generated tokens. Without caching, this means recalculating key and value projections for tokens that haven't changed—pure computational waste. Research shows this results in O(m³·d) operations for generating m tokens autoregressively, where the cost scales cubically with sequence length.

The math is unforgiving. Generating 1,000 tokens without caching requires roughly 332 million unnecessary operations just from redundant key-value computations. With caching, you compute each key-value pair exactly once.

This reflects the fundamental O(n³) computational bottleneck in autoregressive transformer inference. Studies indicate that without KV caching, generating m tokens autoregressively has total cost of O(m³·d) operations, where d is model dimension. For example, generating 1,000 tokens involves O(1,000³) cumulative attention operations. With KV caching, this reduces to O(m²·d) by eliminating redundant key/value projections.

How it works: During attention computation—Attention(Q, K, V) = softmax(QK^T / √d_k) × V—the model caches key and value tensors for all processed tokens. When generating token t, it computes a new query Q_t, retrieves cached keys K_1 through K_{t-1}, computes only the new K_t, and concatenates everything for attention.

Why only cache keys and values? The query represents what the current token is "looking for" in the context—it changes with every new token. Keys and values represent the fixed information content of past tokens that never changes once computed.

Where you'll encounter this: Every production LLM system uses KV caching. OpenAI, Anthropic, and Google all implement it behind their APIs. If you're running inference with HuggingFace Transformers, vLLM, or similar frameworks, caching is enabled by default during generation.

The implementation creates a growing cache structure: [batch_size, num_heads, sequence_length, head_dim] for both keys and values, organized per-layer. For a 32-layer model, that's 64 separate tensors growing with each generated token.

Performance & Memory Tradeoffs

KV caching delivers 5x speedup in autoregressive generation, but the memory costs scale brutally with sequence length and model size.

Memory footprint formula:

Total KV Cache (bytes) = batch_size × sequence_length × 2 × n_layers × d_model × precision_in_bytes

Alternatively, expressed in gigabytes:

Total KV Cache (GB) = batch_size × sequence_length × 2 × n_layers × d_model × precision_in_bytes / (1024³)

Where:

batch_size: Number of independent sequences processed concurrently
sequence_length: Total token count per sequence (input + generated tokens)
2: Multiplicative factor for both Key (K) and Value (V) tensors
n_layers: Total number of transformer decoder layers
d_model: Model's hidden dimension size
precision_in_bytes: Memory per parameter (4 for FP32, 2 for FP16/BF16, 1 for INT8)

Concrete example: Llama-2-7B processing 28,000 tokens requires 14.7 GB just for KV cache at FP16 precision. That's before loading the 13GB model weights. Your 24GB GPU is already over budget.

For Llama-3-70B at 1 million tokens, the KV cache alone consumes approximately 330 GB due to its use of Grouped-Query Attention (GQA) with 8 KV heads instead of the full 64 query heads—a substantial memory requirement that exceeds the capacity of most individual GPUs.

API cost implications are massive. Anthropic offers 90% discounts on cached tokens after paying a 25% write penalty. A 50,000-token knowledge base reused across 10 requests costs $202.50 instead of $300—$97.50 savings. OpenAI provides 50% discounts with automatic caching for prompts over 1,024 tokens. A key difference: Anthropic excludes cached tokens from rate limit consumption (increasing throughput capacity), while OpenAI counts cached tokens fully toward rate limits without providing additional capacity.

But rate limits tell the real story. Anthropic excludes cached tokens from input token per minute (ITPM) limits, effectively increasing your throughput capacity. OpenAI still counts cached tokens against token per minute limits—you save money but don't gain capacity.

When KV cache matters most: Multi-turn conversations, document Q&A with large contexts, code generation with substantial boilerplate, and any scenario where you're reprocessing the same prefix repeatedly.

When it matters least: Single-token generation, very short sequences (under 10 tokens), or batch processing where memory constraints force you into smaller batches that hurt overall throughput.

The inflection point is around 100-500 tokens, depending on your hardware. Below that, cache management overhead often exceeds computational savings.

Common Misconceptions

"KV cache affects output quality."

This is wrong. KV cache stores exact intermediate tensors without approximation or modification. Given identical seeds and hardware, outputs remain deterministic with or without caching. Any quality differences indicate implementation bugs, not the caching mechanism itself.

"Everything can be cached during inference."

No. Only self-attention key and value computations benefit from caching. You still recompute tokenization, embeddings, positional encodings, feed-forward layers, layer normalizations, and output projections at every step. Query vectors change with each new token and can't be cached.

"KV cache is useful during training."

Completely backwards. Training uses teacher forcing to compute all key-value pairs for the full sequence in parallel during a single forward pass. HuggingFace's documentation states: "Caching is only applicable during inference, not training" because training's parallelized nature makes sequential caching both unnecessary and inapplicable. There's nothing to cache from previous generation steps because you're processing the entire sequence simultaneously.

"Cache management in conversations is automatic."

It's more complex than you think. The cache grows linearly with conversation length and must be managed carefully. Anthropic invalidates caches after 5 minutes of inactivity or when system messages are inserted mid-conversation. For long conversations, you'll hit memory limits and need strategies like CPU offloading or cache compression.

Most misconceptions stem from conflating KV cache with general caching systems. This isn't a lookup table—it's a computational optimization within the attention algorithm that trades memory for speed in a very specific mathematical context.

Related terms

Context window: The maximum sequence length your model can process. Larger context windows require more KV cache memory, which scales linearly with sequence length, batch size, number of layers, and model dimensions according to the formula: 2 × batch_size × sequence_length × n_layers × d_model × precision_in_bytes.

Attention mechanism: The core transformer operation that KV cache optimizes. Understanding attention math helps you grasp why only keys and values can be cached.

Prefill vs decode: Two distinct inference phases with different KV cache behaviors. Prefill is a compute-bound phase that populates the cache in parallel for all input tokens simultaneously; decode is a memory-bound phase that extends the cache sequentially, adding KV pairs only for newly generated tokens.

Batch processing: KV cache requirements scale linearly with batch size, creating memory-throughput tradeoffs in production systems.

Quantization: Reduces KV cache memory by storing tensors in lower precision formats (INT8, INT4, or even 1-bit with coupled quantization), trading some accuracy for capacity. Research shows quantization can achieve 6.4x memory reduction under 2-bit quantization with minimal accuracy loss, and can be effectively combined with other optimization techniques like low-rank projection for compounded benefits.

Multi-query attention: Architectural approach that shares key-value projections across attention heads, reducing KV cache to 1/H of the original size (where H = number of heads). Requires model retraining but trades reduced attention expressiveness for significant memory savings.

State-space models: Emerging architectures like Mamba that achieve constant memory complexity by replacing attention with selective state-space updates, eliminating KV cache growth entirely.

Understanding KV cache isn't just about optimization: it's about making informed architectural decisions. When you're choosing between hosting your own models and using APIs, designing multi-turn conversation systems, or planning GPU infrastructure, KV cache memory requirements often become the binding constraint that shapes everything else.

KV Cache

KV Cache

What you need to know

Performance & Memory Tradeoffs

Common Misconceptions

Related terms

OpenClaw: AI Agent That Ships Code While You Sleep (2026)

Recent Posts (2)

OpenClaw: AI Agent That Ships Code While You Sleep (2026)

How To Optimize LLM Inference in Production in 2026