Learn: Deep Dive
January 16, 2026

How To Optimize LLM Inference in Production in 2026

How To Optimize LLM Inference in Production in 2026
Sergey Kaplich
Sergey Kaplich

TL;DR

LLM inference optimization in production can deliver 5-24x throughput improvements and 30-50% cost reductions through proven techniques. Start with 4-bit quantization (<1% accuracy loss, 4x memory reduction), deploy with vLLM for automatic continuous batching and PagedAttention (40-50% KV cache savings), and measure obsessively with metrics like TTFT, TPS, and P99 latency. The production inference problem—slow response times, high GPU costs, memory bottlenecks—is solved through combined optimization: quantization compresses models, continuous batching maximizes GPU utilization, and KV cache optimization eliminates memory fragmentation. Teams routinely achieve sub-500ms TTFT while cutting infrastructure costs in half by treating optimization as integral to production rather than an afterthought.

Table of Contents

Your model works beautifully in development. Then production hits.

Three-second response times. GPU bills that make your CFO weep. Users abandoning your app mid-generation.

The production inference problem.

The gap between development and production has never been wider. A single H100 GPU runs at H100 pricing of $1.45-$7.57 per hour on cloud platforms. Memory bandwidth, not compute power, bottlenecks most deployments. And that KV cache you've never thought about? It's quietly eating 40GB of GPU memory for a 16K context length, consuming 60-80% of VRAM in production scenarios.

The good news: optimization works. Dramatically.

Cloudflare's optimization delivered a 5.18x throughput increase. vLLM shows 24x throughput gains over naive implementations under high concurrency. Teams routinely cut costs by 50% while improving latency through combined quantization, batching, and KV cache optimization.

What is LLM Inference Optimization?

Training teaches a model. Inference uses it.

Every time a user sends a prompt and receives a response, that's inference. It differs fundamentally from training.

LLM inference happens in two phases:

  • Prefill phase: The model processes your entire input prompt in parallel. This is compute-bound. Your GPU is actually doing math.
  • Decode phase: The model generates output tokens one at a time. Each new token requires reading the full model weights from memory. This is memory-bound. Your GPU spends most of its time waiting for data transfer, not calculating.

Databricks' analysis shows the decode phase achieves less than 50% GPU utilization despite available compute capacity. The GPU sits idle, waiting for memory.

Optimization means attacking these bottlenecks: reducing memory transfers, compressing model weights, managing caches efficiently, and batching requests intelligently.

The goal is simple: faster responses, lower costs, higher throughput.

Why Optimization Matters in Production

Development and production are different universes.

In development, you care about one request at a time. In production, you handle hundreds or thousands concurrently.

In development, a 2-second response is fine. In production, users bounce after 500 milliseconds.

In development, your GPU bill is background noise. In production, it's a line item that needs justification.

The cost reality is brutal. GPU pricing ranges from $1.45/hour to $6.98/hour for H100 instances. At 24/7 operation, that's $1,050 to $5,000 per GPU per month. Most production deployments need multiple GPUs.

Memory is the primary constraint. A 70B parameter model in FP16 precision requires 140GB of GPU memory, far exceeding any single consumer GPU. Even with the largest datacenter GPUs, you're immediately forced into multi-GPU configurations or compression techniques.

Without optimization, you're throwing money at hardware that sits partially idle while users wait for responses. That's not a sustainable business model.

Key Optimization Techniques

Quantization

Quantization reduces numerical precision: storing weights in fewer bits.

The numbers are compelling. 4-bit quantization on Llama 3.1 models shows less than 1% average accuracy drop across diverse benchmarks (Red Hat evaluated over 500,000 quantized LLM tests). You get 4x memory reduction for negligible quality loss.

The precision hierarchy:

  • FP32: Full precision, baseline (8 bytes per parameter)
  • FP16/BF16: Half precision, 2x memory reduction vs FP32, lossless
  • FP8: Quarter precision, ~2x reduction vs FP16, lossless with proper implementation
  • INT8: 2x reduction vs FP16, minimal accuracy impact
  • INT4: 4x reduction vs FP16, <1% accuracy loss with proper techniques (GPTQ, AWQ)

AWQ (Activation-Aware Weight Quantization) identifies and protects high-importance weights based on activation patterns. It achieves near-FP16 accuracy with 4-bit weights while maintaining FP16 activations.

GGUF is the format of choice for llama.cpp, supporting everything from 2-bit to 8-bit quantization. Q4_K is the recommended option, offering 4x compression with balanced quality.

Here's how to do it with vLLM:

from vllm import LLM llm = LLM( model="your-model", quantization="awq", # or "gptq" dtype="bfloat16", gpu_memory_utilization=0.9, max_num_batched_tokens=8192, enable_chunked_prefill=True )

Start with 4-bit quantization as your foundation optimization. Before implementing, establish baseline metrics: measure Time to First Token (TTFT), Tokens Per Second (TPS), and end-to-end latency using your production workload. For low-risk applications (content generation), acceptable degradation is 3-5%. For high-risk applications (medical, legal, financial), keep degradation below 1%.

Model Distillation

Model distillation trains a smaller "student" model to mimic a larger "teacher" model through knowledge transfer. Unlike quantization, distillation requires retraining with a combined loss objective. This enables 40-60% parameter reduction while maintaining 3-5% accuracy degradation when properly implemented.

The canonical example: DistilBERT uses 40% fewer parameters than BERT while retaining 97% of performance and running 60% faster. TinyLlama, with 1.1B parameters, significantly outperforms much larger models like OPT-1.3B and Pythia-1.4B on downstream tasks.

When to use distillation:

  • You need architectural changes (not just compression)
  • You're deploying to severely constrained environments (mobile, edge)
  • You can tolerate 3-5% accuracy loss
  • You have training resources and domain-specific data

When to skip it:

  • Quantization's <1% degradation meets your needs
  • You lack GPU compute for retraining
  • Your inference latency already meets production SLAs

Pruning

Pruning removes unnecessary model weights. Research shows 30% pruning on LLaMA-2-70B causes only 0.8% accuracy drop. At 50% pruning, that rises to 3.8%.

Structured pruning removes entire architectural units: neurons, attention heads, whole layers. The resulting model runs efficiently on standard hardware with no special support required.

Unstructured pruning zeros individual weights, creating sparse matrices. Higher compression is possible, but standard GPUs optimize for dense matrix operations, making irregular patterns incompatible without dedicated support.

Structured pruning produces physically smaller models with regular computational patterns compatible with standard GPU acceleration, achieving 2-4x compression with ~2.7x inference speedups.

NVIDIA's Minitron combines structured pruning with distillation, achieving 2-4x compression while requiring up to 40x fewer training tokens than training from scratch.

Caching & KV Cache Optimization

Every transformer layer stores key and value vectors for all processed tokens. The KV cache stores these vectors, reducing per-token computational complexity from O(t²) to O(t) during the decode phase. However, it creates memory consumption that scales with batch size, sequence length, and model depth.

PagedAttention solved the fragmentation problem. Developed for vLLM, it applies OS-style virtual memory to KV cache management:

  • Cache divided into fixed-size blocks (typically 16 tokens each)
  • Blocks allocated non-contiguously as needed
  • Memory sharing for parallel sampling and beam search
  • Result: less than 4% memory waste versus 60-80% with traditional approaches

The benchmarks are striking. vLLM achieves 24x higher throughput than HuggingFace TGI under high-concurrency scenarios, with near-flat latency scaling to 64,000 tokens.

Grouped-Query Attention (GQA) reduces cache size by sharing key/value heads across query head groups. It delivers up to 5.4x speedup over standard multi-head attention while maintaining near-baseline quality. The recommended default for new deployments.

Batching & Parallelization

Batching combines multiple requests into single GPU operations.

Static batching waits for N requests to arrive, then processes them together. Simple but inefficient: fast requests wait for slow ones.

Dynamic batching adjusts batch size based on arrival rate. However, it blocks on the slowest request, creating head-of-line blocking.

Continuous batching operates at the token level. When any sequence completes, its slot immediately goes to the next queued request. Continuous batching delivers up to 24x throughput improvement versus static batching under high concurrency.

vLLM and TGI implement continuous batching automatically.

For multi-GPU deployments:

Tensor parallelism splits individual layers across GPUs when a model can't fit on a single device.

Pipeline parallelism places different layers on different GPUs. Better for throughput when you can tolerate slightly higher latency.

llm = LLM( model="large-model", tensor_parallel_size=2, # Split across 2 GPUs )

Hardware Acceleration

Choosing the right hardware can eliminate entire classes of optimization complexity. Here's what matters in production:

NVIDIA H100 remains the production standard:

AMD MI300X offers superior price-performance:

  • 192GB HBM3 (2.4x more than H100)
  • 5.3TB/s bandwidth
  • Starting at $1.99/hour

Character.ai achieved 2x inference performance improvement because the MI300X's memory efficiently handles large models with extensive conversation history.

AWS Inferentia2 provides purpose-built cost optimization:

  • Starting at $0.76/hour
  • Up to 384GB shared accelerator memory
  • Trade-off: vendor lock-in to AWS Neuron SDK

Apple Silicon enables edge deployment:

The hardware choice often dictates which optimizations you need. More memory means less aggressive quantization. Better bandwidth means less caching pressure.

Tools & Frameworks for Optimization

vLLM leads for high-throughput GPU serving with PagedAttention, continuous batching, 24x throughput versus TGI under high concurrency, and broadest hardware support. It provides an OpenAI-compatible API server out of the box, making migration from OpenAI straightforward. Production deployments benefit from built-in support for tensor parallelism, speculative decoding, and automatic memory management.

TensorRT-LLM maximizes NVIDIA hardware with custom CUDA kernels and 8x speedups. NVIDIA-only, and requires more setup than vLLM, but the performance gains justify the investment for high-throughput NVIDIA deployments where every millisecond counts.

HuggingFace TGI excels in the HuggingFace ecosystem with seamless Hub model loading and production observability built-in.

Triton Inference Server orchestrates multi-framework deployments with TensorRT, PyTorch, ONNX, TensorFlow backends.

ONNX Runtime provides cross-platform portability with the broadest hardware support.

llama.cpp handles CPU inference and edge deployments with minimal dependencies. Supports the GGUF format with quantization options ranging from 2-bit to 8-bit, making it ideal for running models on consumer hardware and Apple Silicon. The Q4_K quantization option provides excellent balance between compression and quality.

Start with vLLM for most deployments. Move to TensorRT-LLM when maximum NVIDIA GPU performance is critical. Use llama.cpp for CPU inference and edge deployments. Choose ONNX Runtime for cross-platform requirements.

Getting Started with Optimization

Week 1: Establish baselines

Measure before you optimize. Critical metrics: Time to First Token (TTFT) targeting <500ms for interactive applications, Tokens per Second (TPS), P50/P95/P99 latency, and GPU utilization targeting 80%+.

vllm bench --model your-model \ --input-len 128 \ --output-len 256 \ --num-prompts 1000

Week 2-4: Quick wins and production integration

Implement quantization first. Highest ROI, lowest complexity. vLLM handles continuous batching and PagedAttention automatically.

from fastapi import FastAPI from vllm import LLM, SamplingParams app = FastAPI() llm = LLM( model="your-model-name", quantization="awq", gpu_memory_utilization=0.9, max_num_batched_tokens=8192, enable_chunked_prefill=True ) @app.post("/generate") async def generate(prompt: str, max_tokens: int = 256): outputs = llm.generate([prompt], SamplingParams(max_tokens=max_tokens)) return {"text": outputs[0].outputs[0].text}

Week 4+: Advanced optimization

Consider model-specific tuning, multi-GPU parallelism, speculative decoding, and custom configurations based on your workload patterns.

Real-World Implementation Examples

Cloudflare Workers AI

Cloudflare Workers AI provides one of the most comprehensive documented optimization journeys. They implemented KV cache compression with PagedAttention, speculative decoding, and global GPU distribution across 150+ cities.

The results: 5.18x throughput increase, 50% price reduction for inference, and sub-second latency globally. Their speculative decoding implementation alone delivered 2-4x inference speedups.

The key insight from Cloudflare's work: stacking optimizations compounds. Each technique amplifies the others.

Character.ai

Character.ai faced a specific challenge: serving conversational AI with extensive conversation histories that could span thousands of tokens. Standard GPU memory became the bottleneck.

By deploying AMD MI300X GPUs with their 192GB HBM3 (2.4x more memory than H100), they achieved 2x inference performance improvement. The larger memory pool eliminated the need for aggressive context truncation and allowed full conversation histories to remain in the KV cache.

This case demonstrates why hardware selection matters. Sometimes the right hardware choice eliminates optimization complexity.

Shopify's Multimodal LLM Deployment

Shopify's deployment for product classification runs 40 million inferences daily. Their stack combines FP8 quantization for memory efficiency, in-flight batching for throughput, and KV cache systems distributed across Kubernetes clusters.

The multi-task fine-tuning approach lets them serve product classification, content moderation, and search ranking from consolidated infrastructure rather than separate model deployments.

vLLM Production Deployments

vLLM deployments documented in UC Berkeley research show consistent patterns across diverse workloads:

  • 30-40% p50 latency reduction for median user experience
  • 20-30% p99 latency reduction for tail latency (critical for SLAs)
  • 1.5-2x throughput improvements with identical hardware
  • 40-50% KV cache memory savings through PagedAttention
  • 15-25% GPU utilization improvement from continuous batching

These improvements came from organizations serving real traffic, not synthetic benchmarks. The latency reductions proved especially valuable for interactive applications where p99 latency directly impacts user perception.

Netflix's Foundation Model

Netflix's approach demonstrates optimization at streaming scale. Rather than running multiple specialized recommendation models, they consolidated into a single foundation model serving personalized recommendations to hundreds of millions of users.

The result: millisecond-level latency with reduced maintenance overhead. While specific optimization techniques aren't public, the architecture choice itself (consolidation over fragmentation) represents a production optimization pattern.

The Pattern

The pattern across all these deployments: combine quantization, PagedAttention, and continuous batching as the baseline. These three techniques stack to deliver 24x higher throughput under high concurrency and 30-50% cost reductions. Then optimize further based on your specific constraints: memory (MI300X), latency (speculative decoding), or global distribution (edge deployment).

Trade-offs & Considerations

Accuracy vs. speed: 4-bit quantization causes <1% accuracy loss. Aggressive pruning (50%) causes 3.8%. Distillation causes 5-15% for complex reasoning tasks.

Cost vs. performance: H100 delivers maximum performance at premium pricing. MI300X offers 27-74% cost savings with superior memory. Inferentia2 minimizes cost with vendor lock-in.

Monitoring requirements: Track latency percentiles (not just averages), token consumption, quality metrics, and drift detection. Specialized tools like Datadog LLM Observability or Arize AI provide LLM-specific quality metrics.

Comparing Optimization Strategies

TechniqueAccuracy ImpactMemory SavingsSpeed GainComplexityStart Here?
4-bit Quantization<1%4x2xLowYes
Continuous Batching0%0%8-24xNone (automatic)Yes
PagedAttention0%40-50%VariableNone (automatic)Yes
Structured Pruning0.8-3.8%1.4-2x1.5-2.7xMediumAfter quantization
Distillation5-15%2x2-3xHighSpecific requirements
Speculative Decoding0%0%60-111%MediumAfter basics

The recommended sequence:

  1. Deploy with vLLM (automatic continuous batching, PagedAttention)
  2. Apply 4-bit quantization (AWQ or GPTQ)
  3. Measure and benchmark
  4. Add speculative decoding if latency-critical
  5. Consider pruning or distillation for extreme compression needs

FAQ

How much accuracy will I lose from optimization?

4-bit quantization: <1%. Pruning at 30%: ~0.8%. Always measure on your specific tasks.

What's the minimum hardware for production LLM inference?

7B models run efficiently on consumer GPUs (24GB VRAM) even in FP16. 70B models in 4-bit require ~35GB, feasible with high-end datacenter hardware.

How do I measure success?

Track TTFT, tokens per second, P99 latency, and cost per request. Latency percentiles matter more than averages.

How much will optimization save on costs?

30-40% infrastructure cost reduction is typical. Cloudflare achieved 50%. The combination of quantization, efficient batching, and KV cache optimization compounds significantly.

What about CPU inference?

llama.cpp handles CPU and Apple Silicon effectively. For Apple Silicon M2 Ultra, expect 60-128 tokens/sec for 7B-14B models. Viable for edge deployment, not competitive with GPU throughput for high-scale serving.

Should I use vLLM or TensorRT-LLM?

Start with vLLM for most deployments. It offers broad hardware support (NVIDIA, AMD, Intel), an OpenAI-compatible API, and excellent documentation. Move to TensorRT-LLM when you're NVIDIA-only and need maximum performance; it can achieve 8x speedups with custom CUDA kernels. The setup complexity is higher, but justified for latency-critical, high-throughput NVIDIA deployments.

How does speculative decoding work and when should I use it?

Speculative decoding uses a smaller "draft" model to generate multiple tokens quickly, then the larger target model verifies them in a single batch. If the draft model guessed correctly, you've generated multiple tokens for the cost of one verification step. Research shows 60-111% improvements. Use it after implementing basics (quantization, continuous batching) for latency-critical applications where TTFT and token generation speed directly impact user experience.

What's the difference between structured and unstructured pruning?

Structured pruning removes entire components (neurons, attention heads, layers), producing physically smaller models that run on standard GPUs. Unstructured pruning zeros individual weights, creating sparse matrices that require specialized hardware support to realize benefits. For practical deployments, structured pruning is the choice: 2-4x compression with ~2.7x speedups on standard hardware.

How do I choose the right quantization format (AWQ vs GPTQ vs GGUF)?

Use AWQ for vLLM deployments. It's activation-aware and maintains near-FP16 accuracy with 4-bit weights. Use GGUF for llama.cpp and edge deployments; it's the native format with flexible quantization options (Q4_K recommended). GPTQ is an alternative to AWQ with similar performance; choose based on which has better support for your specific model.

Can I serve multiple models on the same GPU?

Yes, but memory management becomes critical. vLLM supports serving multiple models through its scheduler, and you can configure gpu_memory_utilization to leave headroom. The practical approach: quantize aggressively (4-bit) to fit more models, use continuous batching to maximize utilization, and monitor memory pressure carefully. For production multi-model serving, consider dedicated model routing with separate GPU pools for critical models.

Making Optimization Work for Your Use Case

LLM inference optimization isn't a one-time project. It's an ongoing practice.

Start with the fundamentals: vLLM, 4-bit quantization, proper benchmarking. These deliver most gains with minimal complexity. Measure obsessively. You can't optimize what you don't measure.

Then iterate. Your workload patterns will reveal specific bottlenecks.

Long contexts? Focus on KV cache optimization.

Latency-critical? Explore speculative decoding.

Cost-constrained? Consider hardware alternatives.

The research is clear. Optimization works. Measurably. Teams that treat optimization as integral to production, not an afterthought, ship better products at lower cost.

Your users don't see your GPU utilization metrics. They see response times. They see availability. They see whether your application works.

Make it work. Make it fast. Make it sustainable.