Semantic Chunking

TL;DR

You split your 50-page technical manual at exactly 512 tokens. Halfway through explaining database normalization, your chunker slices the concept in half. Your RAG system retrieves the fragment. Your AI hallucinates the rest.

Semantic chunking promises to fix this by measuring how similar consecutive sentences are using embeddings, splitting only where meaning shifts. But here's what the 2024-2025 benchmarks actually show: it costs more, takes longer to implement, and doesn't consistently outperform simple fixed-size chunking with overlap. The performance gains are minimal, inconsistent, and often shrink significantly with quality embeddings.

Start with fixed-size. Add overlap. Measure your results. Upgrade to semantic only if your metrics prove you need it.

What you need to know

Fixed-size chunking takes your document and chops it at exactly 400-600 tokens. Mid-sentence? Mid-paragraph? Right in the middle of explaining why quantum computers aren't actually magic? Doesn't matter. The knife falls where the count says.

Semantic chunking works differently. It reads each sentence, converts it to an embedding (a mathematical representation of meaning), then measures similarity between consecutive sentences. When that similarity drops significantly—say, when you shift from discussing database design to suddenly talking about user authentication—it places a boundary there.

The appeal is obvious. Instead of arbitrary boundaries that slice through important explanations, you get semantically coherent pieces that preserve complete thoughts and group related concepts together. Each chunk makes sense on its own.

Then Vectara and Wisconsin-Madison researchers put it to the test.

They compared semantic chunking against fixed-size across document retrieval, evidence retrieval, and answer generation tasks. The findings challenge everything the marketing materials promise.

In realistic scenarios with typical document structures, fixed-size chunking matched or exceeded semantic chunking's performance. The computational overhead—generating embeddings for every sentence, calculating similarity scores, running threshold algorithms—delivered negligible gains.

When researchers tested on artificially stitched documents containing drastically different topics, semantic chunking showed advantages. In production scenarios with normal text? The benefits vanished.

You're generating embeddings for every sentence, computing similarity matrices, running threshold detection. This pipeline is still ~2-5× slower for most implementations, though newer clustering variants narrow the gap. For what gain? Modern language models handle imperfect chunk boundaries remarkably well—especially extended-context GPT-4-class and Claude-class models. They can identify relevant content within slightly messy chunks. The sophisticated preprocessing doesn't buy you what you'd expect.

And embedding models keep improving. Recent advances in context-aware retrieval mean models better interpret chunks with surrounding context windows, further reducing the advantage of perfect semantic boundaries. The gap between sophisticated chunking and simple fixed-size keeps shrinking.

The pattern rarely changes—simple math, expensive compute, tidy cuts. You split text into sentences, optionally group them (buffer size of 1-3 sentences), generate embeddings for each group, calculate cosine similarity between consecutive groups, then create boundaries where similarity drops. LangChain offers four threshold detection methods—percentile (95th by default), standard deviation, interquartile range, and gradient. LlamaIndex uses only percentile thresholding. The percentile approach calculates all similarity drops, then splits wherever the drop exceeds your chosen threshold.

Newer variants exist. Max–Min semantic chunking uses clustering algorithms across sentence embeddings to form more coherent groups, showing improved clustering metrics in controlled tests (higher intra-chunk similarity, lower inter-chunk similarity). But improved coherence scores don't necessarily translate to better production RAG performance—the fundamental cost/benefit question remains.

Even as new clustering tricks emerge, real-world evaluations keep humbling them.

NVIDIA's 2024 benchmark tested seven chunking strategies across five datasets. Page-level chunking won with retrieval F1≈0.648 (lowest variance across datasets). Financial reports, legal documents, and research papers organize information by pages. Respecting existing structure helps retrieval find the right context. Semantic chunking ranked behind this simpler structural approach.

Chunk size matters more than you'd think. The common wisdom says 256-512 tokens works for most cases, but 2025 research reveals that optimal size depends on your query type. Factoid queries—"What is the capital of France?"—perform best with smaller chunks (64-128 tokens). Analytical queries requiring broader context need larger chunks (512-1024 tokens). Method matters; granularity matters more.

When semantic chunking actually delivers value:

Complex documents with frequent topic shifts. Technical manuals mixing different concepts where preserving complete arguments matters. Research papers where fragmenting a proof or explanation breaks understanding. Medical, legal, or financial content where context preservation is critical and computational cost is justified by measurably better retrieval quality for your specific use case.

Measurably. That word matters. Don't assume semantic chunking helps. Prove it with your own data, your own queries, your own retrieval metrics.

Fixed-size chunking works well for homogeneous content like news articles or blog posts. High-volume systems where processing speed matters. Rapid prototyping when you need baseline performance quickly. When you're starting out and don't yet know what good looks like.

The smart approach? Start with recursive character splitting at ~400 tokens with 10-20% overlap. Measure your retrieval accuracy, answer quality, and user satisfaction. Run actual queries. Check actual results. Upgrade to semantic chunking only if your metrics prove the additional complexity delivers value. Industry testing shows semantic chunking's overhead doesn't consistently justify performance gains across all use cases.

LangChain's SemanticChunker offers four threshold types with custom configuration. You can set breakpoint_threshold_type to percentile, standard_deviation, interquartile, or gradient, then tune the breakpoint_threshold_amount to control sensitivity. LlamaIndex provides SemanticSplitterNodeParser with percentile thresholding and configurable breakpoint_percentile_threshold. Both can use OpenAI embeddings, Cohere models, or open-source alternatives like all-MiniLM-L6-v2.

But there's a catch. Neither framework natively limits chunk size. A semantic boundary might create a 5000-token chunk that exceeds your model's context window. You'll need additional logic to split oversized chunks while preserving semantic boundaries where possible. This adds complexity.

Vectara defaults to sentence-based chunking, not semantic chunking, despite publishing research on both methods. They offer max_chars_chunking_strategy for fixed-size chunks with configurable limits. After extensive testing, their production recommendation is 3-7 sentences per chunk (512-1024 characters) to balance retrieval latency and context preservation. That tells you something about what actually works at scale.

Recent research on "late chunking" offers another perspective. Instead of chunking first then embedding, you encode the entire document with full context into token embeddings, then break the token sequence into chunks via mean pooling. This preserves context from the whole document in each chunk's embedding. The approach shows promise but adds yet more computational overhead.

Dynamic granularity approaches are emerging—systems that adjust chunk size per knowledge source or query type using retrieval feedback loops or query embeddings. Mix-of-Granularity methods show potential but introduce another layer of complexity on top of semantic chunking. In practice, these systems pick chunk size using query embeddings or retrieval feedback (hit-rate, MRR) rather than static rules. The frontier keeps moving toward more sophisticated preprocessing.

The pattern repeats across the research. Advanced chunking methods show theoretical advantages in contrived scenarios, then fail to deliver consistent gains in production. The models compensate. The retrievers adapt. The marginal improvements don't justify the marginal costs.

Bottom line: Semantic chunking solves real problems for complex, multi-topic documents. It introduces overhead that may not be justified for simpler content. The 2024-2025 benchmarks make this clear—it's a context-dependent optimization, not a universal requirement.

Start simple. Measure everything. Then—and only then—get fancy.

Related terms

RAG - Retrieval-Augmented Generation systems that find relevant chunks and feed them to language models

Embeddings - Mathematical representations of text meaning used to measure semantic similarity

Vector database - Storage systems optimized for embedding similarity search

Fixed-size chunking - Simple splitting at predetermined character or token counts

Context window - Maximum amount of text a language model can process at once

Retrieval - The process of finding relevant chunks from your document store

Token - Basic units of text that language models process (roughly 3-4 characters)

Cosine similarity - Mathematical measure of how similar two embedding vectors are

Real-world example

Consider this text about database design:

Database normalization reduces redundancy by organizing data into related tables, a concept that parallels semantic chunking's goal of organizing information coherently. While database design principles focus on eliminating data duplication through structured relationships, semantic chunking addresses a different problem: grouping semantically related text segments to preserve meaning and context in retrieval systems. Each database table has a primary key that uniquely identifies rows, and foreign keys create relationships between tables by referencing primary keys in other tables. Connection pooling manages database connections efficiently by reusing existing connections instead of creating new ones for each request.

Fixed-size chunking (60 characters) creates:

Chunk 1: "Database normalization reduces redundancy by organizing da"
Chunk 2: "ta into related tables. Each table should have a primary"

Notice how "data" gets split, and the concept of primary keys starts mid-chunk. Your RAG system trying to answer "What is database normalization?" retrieves fragments that don't contain complete information.

Semantic chunking recognizes that sentences 1-3 discuss database structure concepts (normalization, keys, relationships) while sentence 4 shifts to performance optimization (connection pooling). It groups the first three sentences together and separates the connection pooling concept.

The first chunk preserves the complete explanation of normalization and table relationships. The second chunk contains the complete connection pooling concept. When your AI searches for information about either topic, it retrieves coherent, complete context instead of partial fragments.

The difference matters when your retrieval quality impacts your product quality. For a technical documentation chatbot, fragmenting key concepts creates confusion. For a high-volume news aggregator where queries are simple and documents are uniform, the sophisticated approach costs more than it delivers.

Measure which scenario you're actually in.

TL;DR

What you need to know

Related terms

Real-world example

OpenClaw: AI Agent That Ships Code While You Sleep (2026)

Recent Posts (2)

OpenClaw: AI Agent That Ships Code While You Sleep (2026)

How To Optimize LLM Inference in Production in 2026