Learn: Deep Dive
November 20, 2025

What is KV Cache? A Complete Guide to Faster LLM Inference

KV Cache image of black and white blocks
Stevie Kim headshot, Product and Engineering at GrowthX
Stevie Kim

What is KV Cache? A Practical Guide to Faster Transformer Inference

TL;DR

KV cache is a critical optimization technique that stores previously computed Key and Value tensors during transformer inference, eliminating redundant calculations and reducing generation complexity from O(T³) to O(T²). This translates to 2-24x throughput improvements in production systems. Without KV cache, each new token would require recomputing attention for all previous tokens; with it, only the new token's projections need computation while cached values are reused. Memory consumption scales linearly with sequence length (a 70B model at 128K context needs ~40GB for cache alone), so production deployments combine KV cache with techniques like PagedAttention, quantization, and continuous batching to maximize efficiency.

Watch a 70B model generate text without optimization, and you'll see it sweat. Each new token triggers a cascade—the model recomputes attention for every previous token, again and again and again. Token 100? Recompute 1-99. Token 101? Recompute 1-100. The redundancy is brutal.

Without optimization, the self-attention mechanism requires recomputing key (K) and value (V) vectors for all previous tokens at each generation step, resulting in O(T³) total complexity for generating T tokens. KV cache kills this redundancy dead by stashing previously computed attention values, reducing per-token complexity from O(t²) to O(t) at position t. The payoff? Production systems report throughput improvements up to 5-24x with continuous batching and PagedAttention, while specialized configurations achieve up to 9.9x reductions in Time-to-First-Token latency.

If you're building anything that generates text with transformers, you need to understand this.

What is KV Cache?

KV cache is an optimization technique that stores the Key and Value projection vectors computed during transformer self-attention, eliminating redundant calculations during autoregressive text generation.

The specific problem it solves: every time a transformer generates a new token, the attention mechanism needs access to the key and value vectors for all previous tokens. Without caching, the model recomputes these vectors from scratch at every step. Token 100 requires computing K and V for tokens 1-99. Token 101 requires computing K and V for tokens 1-100. The redundancy is staggering.

KV cache sits at the heart of transformer inference optimization. While techniques like quantization and pruning compress the model itself, KV cache optimizes the generation process. It's not optional for production deployment—it's the baseline that makes everything else possible.

Why KV Cache Exists & How It Works

The Attention Mechanism in 60 Seconds

Transformer attention operates through a precise mathematical pipeline. Input embeddings get projected into three matrices: Query (Q), Key (K), and Value (V). The core formula:

Attention(Q,K,V) = softmax(Q·K^T / √d_k) · V

Think of it this way:

  • Query (Q): "What am I looking for?"
  • Key (K): "What do I offer for matching?"
  • Value (V): "What information do I actually contain?"

During generation, the Query vector for each new token attends to all previously cached Key vectors to determine relevance, then retrieves information from the corresponding cached Value vectors. This asymmetry—where a single new Query attends to a growing set of cached Keys and Values—enables the efficiency gains of KV caching.

The Recomputation Problem

Here's where things get expensive. At each generation step, the attention formula requires:

Attention(Q_t, K_{1:t}, V_{1:t})

Where K_{1:t} means "all keys from position 1 to position t."

Without caching, generating token 100 means:

  1. Recompute K and V for tokens 1-99 from scratch (from hidden states through projection matrices W_k and W_v)
  2. Compute K and V for token 100
  3. Run attention using Q_100 with full K_{1:100} and V_{1:100}

This requires O(100²) operations per token at position 100, scaling to O(T³) total complexity for generating T tokens. With KV caching, step 1 is eliminated—K_{1:99} and V_{1:99} are retrieved from cache instead, reducing per-token complexity to O(100) and total generation complexity to O(T²). Night and day.

Generating token 101 means:

  1. Compute K and V for token 101 only
  2. Retrieve cached K and V for tokens 1-100
  3. Concatenate cached K and V with newly computed token 101's K and V
  4. Run attention using Q from token 101 with the full concatenated K and V

The redundancy compounds. Per-token complexity becomes O(t), and total generation complexity becomes O(T²) for T tokens.

How KV Cache Eliminates Redundancy

The fix is elegant: stash what you've already figured out.

Step 1 — Initial Token (Prefill Phase): Process the input prompt, computing Q, K, and V for all tokens. Store K and V in the cache.

Step 2 — Subsequent Tokens (Decode Phase): For each new token:

  1. Compute Q, K, and V for only the new token
  2. Retrieve cached K and V from previous steps
  3. Concatenate: K_full = [K_cached, K_new]
  4. Run attention with the full context
  5. Update cache with new K and V
# Pseudocode for the decode phase K_cached, V_cached = past_key_values[layer_idx] K_full = torch.cat([K_cached, K_new], dim=1) V_full = torch.cat([V_cached, V_new], dim=1) attention_output = attention(Q_new, K_full, V_full) past_key_values[layer_idx] = (K_full, V_full)

The asymmetry is key: Q is only for the new token, but it attends to all cached keys and values.

The Numbers

Without KV cache at sequence length 2,000: ~4,000,000 operations per token.

With KV cache at sequence length 2,000: ~2,000 operations per token.

A 2,000x reduction in operations per token compared to naive recomputation at 2,000-token sequence lengths. Brutal math.

Key Features & Capabilities

So that's how it works. Now let's talk about what you're trading away.

Memory-Speed Trade-off

KV cache trades memory for speed. The formula for memory consumption:

Memory = 2 × batch_size × num_layers × num_heads × seq_length × head_dim × bytes_per_element

The factor of 2 in the KV cache memory formula accounts for storing both Key (K) and Value (V) tensors separately, as defined in the complete memory calculation: Memory_KV = 2 × batch_size × num_layers × num_heads × seq_len × d_head × bytes_per_element.

Concrete example for a 70B parameter model in float16:

  • 80 layers, 64 heads, head dimension 128
  • At 128K tokens: ~40 GB just for KV cache
  • This often exceeds the model weights themselves

Sequence length is the killer. Memory grows relentlessly with sequence length—doubling it doubles memory requirements. Going from 512 to 4,096 tokens (an 8x increase) requires 8x more memory for KV cache storage.

Hardware implications:

  • A100 80GB can handle ~4K context at batch size 1 for 70B models (with quantization)
  • For 70B models at 128k tokens, KV cache alone can consume approximately 40GB of memory in float16 precision
  • Very long contexts (128K+) may require 4-8 GPUs or additional CPU/storage memory just for cache storage
  • Memory planning must account for: model weights + KV cache + activation memory

Multi-Layer Caching

Each transformer layer maintains its own cache:

past_key_values = [ (keys_layer_0, values_layer_0), (keys_layer_1, values_layer_1), # ... for each transformer layer ]

Each layer's cache has shape [batch_size, num_heads, sequence_length, head_dim].

This architecture means cache memory scales linearly with model depth. A 32-layer model needs 32× the per-layer cache of a single layer.

Batch Processing with KV Cache

Here's where things get interesting for production systems.

Traditional approaches pre-allocate contiguous memory blocks for worst-case sequence lengths, wasting 60-80% of GPU memory. Modern systems like vLLM use PagedAttention to allocate memory in fixed-size blocks, reducing memory waste to under 4%:

  • Memory waste drops to under 4%
  • Non-contiguous storage enables flexible allocation
  • LRU eviction policies manage memory pressure

Continuous batching takes this further—new requests join ongoing batches dynamically, rather than waiting for batch boundaries. Combined with KV cache optimization, this achieves up to 23x throughput improvements compared to static batching approaches. That's real money saved.

Getting Started

Enough theory. Let's write some code.

Basic Generation with Hugging Face

from transformers import AutoTokenizer, AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("gpt2") tokenizer = AutoTokenizer.from_pretrained("gpt2") prompt = "The future of AI is" input_ids = tokenizer(prompt, return_tensors="pt").input_ids # WITH cache (fast) outputs = model.generate( input_ids, max_new_tokens=50, use_cache=True, # Enable caching return_dict_in_generate=True ) generated_text = tokenizer.decode(outputs.sequences[0])

The use_cache=True parameter enables caching during model generation. When you run this code, the model processes your prompt in a single forward pass (the prefill phase), storing all computed K and V tensors. For each subsequent token, only the new token's projections get computed—the rest comes straight from cache. This is why generation feels snappy even for longer outputs.

Explicitly setting this parameter is important for controlling KV cache behavior, as proper cache management requires intentional configuration rather than relying on defaults.

Manual Cache Management

For finer control:

from transformers import DynamicCache import torch model = AutoModelForCausalLM.from_pretrained("gpt2") tokenizer = AutoTokenizer.from_pretrained("gpt2") cache = DynamicCache() prompt = "The quick brown fox" inputs = tokenizer(prompt, return_tensors="pt") generated_ids = inputs.input_ids for _ in range(20): outputs = model( input_ids=generated_ids if cache.get_seq_length() == 0 else generated_ids[:, -1:], past_key_values=cache, use_cache=True, return_dict=True ) cache = outputs.past_key_values next_token = torch.argmax(outputs.logits[:, -1, :], dim=-1, keepdim=True) generated_ids = torch.cat([generated_ids, next_token], dim=-1) if next_token.item() == tokenizer.eos_token_id: break print(f"Generated: {tokenizer.decode(generated_ids[0])}") print(f"Cache sequence length: {cache.get_seq_length()}")

Notice the key pattern here: after the first forward pass, you only pass the last token (generated_ids[:, -1:]), not the full sequence. This is the entire point of KV caching. The conditional if cache.get_seq_length() == 0 handles the initial pass where there's no cache yet—you need the full input. Every subsequent iteration passes only the single new token, and the model combines it with the cached context. Watch the cache.get_seq_length() grow by one each iteration—that's your cache accumulating the conversation history.

Benchmarking the Difference

import time import torch model = AutoModelForCausalLM.from_pretrained("gpt2").to("cuda") tokenizer = AutoTokenizer.from_pretrained("gpt2") inputs = tokenizer("Once upon a time", return_tensors="pt").to("cuda") # WITH cache start = time.time() with torch.no_grad(): outputs_cached = model.generate(**inputs, max_new_tokens=100, use_cache=True) time_cached = time.time() - start # WITHOUT cache start = time.time() with torch.no_grad(): outputs_no_cache = model.generate(**inputs, max_new_tokens=100, use_cache=False) time_no_cache = time.time() - start print(f"With cache: {time_cached:.2f}s") print(f"Without cache: {time_no_cache:.2f}s") print(f"Speedup: {time_no_cache/time_cached:.1f}x")

On a typical GPU, expect 3-5x speedups for this 100-token generation. The gap widens dramatically with longer sequences—at 500 tokens you might see 8-10x, and at 2,000 tokens the difference becomes almost absurd. The reason is the O(T²) vs O(T³) complexity difference: without caching, each additional token makes all previous tokens more expensive to process. With caching, each token costs roughly the same regardless of position.

Production systems achieve 2-24x throughput improvements depending on implementation (2-4x with PagedAttention, up to 15x with LMCache), and specialized deployments see 9.9x latency reductions for time-to-first-token on 100K+ token contexts.

Advanced Usage & Best Practices

Cache Implementation Types

Hugging Face provides several cache implementations:

Cache TypeUse CaseJIT Compatible
DynamicCacheDefault, flexible generationNo
StaticCacheProduction with torch.compileYes
QuantizedCacheMemory-constrained environmentsNo
OffloadedCacheVery long sequencesNo

For production with torch.compile:

model = AutoModelForCausalLM.from_pretrained("gpt2") compiled_model = torch.compile(model, mode="reduce-overhead") outputs = compiled_model.generate( **inputs, max_new_tokens=30, cache_implementation="static", do_sample=False )

When to Clear Cache

# CORRECT: KV cache accumulates during generation, enabling efficient inference # Cache is automatically managed by transformers and should be cleared between separate inference calls for prompt in prompts: outputs = model.generate( tokenizer(prompt, return_tensors="pt").input_ids, use_cache=True, # Enable KV caching for efficient token generation max_new_tokens=50 ) # CORRECT: Explicitly manage cache lifecycle for prompt in prompts: model.generation_config.cache_implementation = None # Reset outputs = model.generate(tokenizer(prompt, return_tensors="pt").input_ids)

Memory Optimization Techniques

Quantization:

INT8 and FP8 quantization reduce the precision of cached K and V tensors from 16-bit floating point to 8-bit integers or floats. This cuts memory consumption in half with minimal impact on output quality—typically less than 1% degradation on standard benchmarks. The tradeoff is straightforward: you're betting that the attention mechanism doesn't need full precision to make good decisions about which tokens matter. For most production workloads, this bet pays off handsomely. FP8 is generally preferred when your hardware supports it natively (H100 and newer), as it maintains better numerical properties than INT8.

CPU Offloading:

When GPU memory becomes the bottleneck, offloading KV cache to CPU memory keeps generation possible at the cost of latency. Each token generation requires transferring cached tensors from CPU to GPU, computing attention, and potentially writing updated cache back. On systems with NVLink or CXL interconnects (like NVIDIA Grace-Hopper), this penalty is manageable—900 GB/s bandwidth means a few milliseconds per token. On standard PCIe connections, expect 10-50ms overhead per token depending on cache size. Use this when you need to serve a model that simply won't fit otherwise, but understand you're trading significant latency for the privilege.

PagedAttention (vLLM):

PagedAttention treats KV cache memory like a virtual memory system—instead of pre-allocating one contiguous block per sequence, it allocates fixed-size pages on demand. Think of it like the difference between reserving an entire hotel floor versus booking individual rooms as needed. Traditional allocation wastes 60-80% of memory on empty space reserved for sequences that might grow longer. PagedAttention reduces this waste to under 4%. The implementation maintains a page table mapping logical token positions to physical memory locations, enabling sequences to grow without expensive memory copies. This is why vLLM achieves such dramatic throughput improvements.

Production Configuration

For high-throughput serving, memory allocation strategy matters enormously. Allocate up to 90% of free GPU memory for KV cache—leaving headroom causes underutilization, but going too aggressive triggers OOM errors during traffic spikes. The remaining 10% handles activation memory and framework overhead.

Enable continuous batching to maximize GPU utilization. Without it, a batch of 8 requests where one generates 500 tokens and seven generate 50 tokens will hold all eight slots until the longest finishes. With continuous batching, the seven short requests complete and get replaced while the long one continues, keeping throughput high.

Prefix caching deserves special attention for applications with repeated context. If every request starts with the same 2,000-token system prompt, computing K and V for those tokens once and reusing them across requests saves massive computation. Production systems report 10x cost reduction on cached prefixes.

Monitor cache hit rates continuously. High eviction frequency—where the system constantly throws away cached data to make room for new requests—indicates you've undersized your cache allocation relative to your traffic pattern. Either add memory, reduce batch sizes, or implement smarter eviction policies that preserve high-value cached content.

Real-World Usage

OpenAI GPT-4

OpenAI plays the caching game aggressively with GPT-4. Cache hits deliver up to 80% latency reduction, and cached content costs 90% less in input token pricing. The cache retention policy is dynamic—frequently-accessed prefixes stick around for hours, while one-off content might evaporate in minutes. This creates interesting optimization opportunities: if your application sends the same system prompt repeatedly, you're essentially getting it for free after the first request. Batch similar requests together in time, and you'll see your costs drop substantially.

Google Gemini

Google implements a three-tier storage architecture that reflects the memory hierarchy realities of large-scale inference:

  1. GPU HBM (fastest, most expensive)
  2. CPU RAM (medium speed, cheaper)
  3. Local SSD (slowest, cheapest)

Hot cache data lives in HBM for instant access. Warm data drops to CPU memory, adding milliseconds of latency but keeping content available. Cold data can persist to SSD for very long-running conversations. Results from Google DeepMind's tiered KV cache system demonstrate: 79% reduction in Time-to-First-Token, 264% increase in input throughput for 100K+ token contexts. The tiered approach lets them serve million-token contexts that would be impossible with GPU-only caching.

Meta LLaMA

LLaMA uses Grouped Query Attention (GQA) architecturally—a training-time decision that fundamentally changes how KV cache scales. Standard multi-head attention gives each query head its own key and value head, meaning cache size scales with total head count. GQA shares key-value heads across groups of query heads, reducing cache to 12.5-25% of standard size depending on group configuration. The tradeoff happens during training: you're betting that shared key-value representations capture enough information for multiple query perspectives. Meta's extensive ablations showed this bet pays off—near-identical quality with dramatically better inference characteristics. Combined with quantization, this architectural choice delivers:

  • 2-4x speedup
  • 41-56% memory reduction
  • Support for context lengths up to 128,000 tokens

Production Systems

LMCache achieves up to 15x throughput improvement and up to 2x faster token generation latency through cross-query cache sharing and prefill-decode disaggregation, as documented in the LMCache technical report.

KVLink optimizes inference for document-heavy workloads by precomputing KV tensors for documents and concatenating them with positional embedding adjustments, reducing the need to recompute key-value cache for repeated document contexts.

The lesson from these production systems is consistent: the biggest wins come from avoiding redundant computation across requests, not just within a single generation. If two users ask questions about the same document, computing that document's KV cache once and sharing it delivers order-of-magnitude improvements.

Comparison with Alternatives

KV cache isn't an alternative to other optimization techniques—it's the foundation that makes transformer inference practical at all. Understanding how other techniques interact with KV cache helps you build an effective optimization stack.

TechniqueWhat It DoesMemory ImpactPerformance ImprovementStacks With KV Cache?
KV CacheStores attention K/V tensorsScales linearly with sequence length2-10x throughput (vs non-cached)N/A (foundational)
QuantizationReduces K/V precision to INT8/FP850-75% cache memory reduction1.5-3x throughput with native supportYes—quantize cached tensors
DistillationSmaller model with retraining40-60% model size reduction2-4x latency reductionYes—smaller model = smaller cache
PruningRemoves attention heads/weightsVariable reduction per methodVariable (hardware-dependent)Yes—fewer active heads = smaller cache
Speculative DecodingDraft model generates tokens in parallelMinimal overhead (shared cache)2-3x latency reduction (no accuracy loss)Yes—both models share KV cache
MQA/GQASingle or grouped K/V heads3-25% of standard cache size1.5-2x inference speedYes—architectural choice reduces cache

The key insight: KV cache provides the baseline that everything else builds on. Without it, you're starting from such a bad position that other optimizations barely matter. With it, you can stack additional techniques to squeeze out another 2-10x depending on your constraints.

When KV cache alone isn't enough:

  • Memory-constrained: Add quantization
  • Latency-critical: Add speculative decoding
  • Cost optimization: Add quantization + smaller models via distillation

The stack for maximum performance:

  1. GQA architecture (training-time decision)
  2. KV cache with PagedAttention
  3. FP8/INT8 quantization
  4. Continuous batching
  5. Speculative decoding for latency-sensitive applications

Limitations & Considerations

All this sounds great, but there's no free lunch.

Memory Scaling

For a 70B parameter model at 128K context length: approximately 40 GB just for KV cache. This often exceeds available HBM.

The three failure modes:

  1. Out-of-memory (OOM) errors during inference
  2. Increased latency from data transfer overhead (HBM ↔ DRAM ↔ CPU)
  3. Throughput degradation during autoregressive decoding and prefilling

Architectural Solutions

Multi-Query Attention (MQA):

  • Shares single K/V head across all query heads
  • Reduces cache to ~3% of standard size (for 32-head models)
  • Slight accuracy loss (~5-10%) with >11x throughput improvement
  • Training-time decision—can't retrofit

Grouped Query Attention (GQA):

  • Partitions query heads into groups, each group shares key/value heads
  • Reduces cache to 12.5-25% of MHA size (depending on number of groups)
  • Maintains near-MHA quality with minimal accuracy loss
  • Used by LLaMA, Mistral, and other production models

Sliding Window Attention:

  • Only attends to recent tokens (fixed window)
  • Bounded memory regardless of sequence length
  • Requires special handling for global context

Common Pitfalls

Not clearing cache between sessions creates insidious bugs. The cache accumulates state from previous generations, causing the model to condition on irrelevant context. Symptoms include nonsensical outputs, repeated phrases from earlier prompts, or mysterious OOM errors that only appear after running for a while.

Assuming linear cache growth trips up developers working with modern architectures. Models using sliding window attention (Mistral, some LLaMA variants) plateau or even shrink their cache after reaching the window size. If you're budgeting memory based on sequence length × constant, you'll over-allocate for these models.

Using cache during training causes dimension mismatches and incorrect gradients. Training uses teacher forcing where the model sees the entire target sequence simultaneously—there's no sequential generation that would benefit from caching. The use_cache parameter should always be False during training.

Ignoring hardware limits leads to production outages. Memory requirements compound: model weights + KV cache + activation memory + framework overhead must all fit. A model that runs fine on your development machine might OOM in production with different batch sizes or longer contexts.

FAQ

Why doesn't the model recompute Q as well as K and V?

Q (Query) represents "what the current token is looking for." It only matters for the token being generated right now. Previous tokens' queries are irrelevant for the current attention computation—we only need their keys and values.

Can I use KV cache during training?

No. Training uses teacher forcing where the model sees the entire target sequence simultaneously—all positions are processed in parallel, not sequentially. There's no "previous token" in the sense that inference has, because the model attends to the full sequence at once. Caching would add memory overhead without any computational benefit, and the cache update logic would interfere with proper gradient computation.

How do I know if KV cache is actually being used?

Check the cache object after generation:

outputs = model.generate(..., return_dict_in_generate=True) if outputs.past_key_values: print(f"Cache has {len(outputs.past_key_values)} layers")

What happens when the cache gets too large?

Options: quantize the cache, offload to CPU, use PagedAttention for dynamic allocation, or implement sliding window attention to bound cache size.

Does batch size affect cache memory linearly?

Yes. Batch size 8 requires 8× the cache memory of batch size 1 because KV cache memory scales linearly with batch size according to the formula: Memory_KV = 2 × batch_size × num_layers × num_heads × seq_len × head_dim × bytes_per_element. This is why production systems carefully balance batch size against available memory, using strategies like PagedAttention for efficient allocation and continuous batching to maximize GPU utilization.

Can different requests share KV cache?

Yes—this is called prefix caching. According to OpenAI's official documentation, requests with identical prefixes (system prompts, few-shot examples) can reuse cached K/V tensors. Anthropic similarly implements prompt caching with KV matrix reuse for cached content, though both approaches achieve substantially different performance profiles in production.

What's the difference between DynamicCache and StaticCache?

DynamicCache provides flexible memory management that grows dynamically with sequence length, while StaticCache pre-allocates fixed-size memory blocks and enables JIT compilation for optimized inference performance. The choice between them involves trade-offs: DynamicCache offers flexibility for variable-length sequences but cannot be compiled with torch.compile, whereas StaticCache requires knowing the maximum sequence length in advance but unlocks compiler optimizations.

How does KV cache work with multi-GPU setups?

Tensor parallelism and pipeline parallelism are multi-GPU distribution strategies for handling large models. Tensor parallelism splits model layers across GPUs, potentially improving latency for KV cache operations, while pipeline parallelism assigns different layers to different GPUs, optimizing throughput. Both techniques require careful memory planning to balance cache distribution, GPU utilization, and communication overhead across the cluster.

Making KV Cache Work for You

KV cache transforms transformer inference from "theoretically possible" to "production-ready." Without it, generating long sequences would be computationally prohibitive.

There's something elegant about KV cache—it embodies a fundamental truth of computation: remembering is almost always cheaper than recomputing. The trick is knowing what to remember, how long to keep it, and when to let it go.

Key takeaways:

  1. KV cache is the baseline—enable use_cache=True for any generation task
  2. Memory scales with batch_size × layers × heads × seq_length × head_dim
  3. Clear cache between independent sessions
  4. For production: combine with PagedAttention, quantization, and continuous batching
  5. Monitor cache hit rates and memory utilization

Next steps:

If you're deploying models, start with Hugging Face's DynamicCache and measure your memory consumption. For long sequences or high throughput requirements, implement PagedAttention (achieves 2-4x GPU utilization improvement) through vLLM to enable non-contiguous memory allocation. If you're hitting memory limits, combine PagedAttention with quantization (FP8 for 75% reduction, NVFP4 for long-context scenarios), or consider Grouped Query Attention (GQA) if retraining is feasible for 12.5-25% cache reduction. For production deployment at scale, TensorRT-LLM provides CUDA kernel optimizations alongside PagedAttention, with priority-based cache management for approximately 20% improvement in cache hit rates.

If you're building models, consider GQA architecture during training—it's a permanent decision that pays dividends in deployment.

This guide focuses specifically on KV cache optimization for transformer inference. KV cache is a foundational technique used in production LLM systems, reducing autoregressive generation complexity from O(T³) to O(T²) and enabling substantial throughput improvements when combined with other optimizations like quantization and continuous batching. Understanding KV cache moves you from "I can run a model" to "I can deploy a model efficiently."

That's a meaningful difference—the difference between a demo and a product. Now go make something fast.