RAG Chunking in 2026: Late Chunking, Contextual Retrieval, and Sane Defaults

When RAG fails, it is almost always retrieval, not generation. Here are the chunking strategies that actually move the needle in 2026.

Sam CarterJun 21, 2026 9 min read

Cover image for RAG Chunking in 2026: Late Chunking, Contextual Retrieval, and Sane Defaults — Photo: djwtwo / flickr (BY-NC-SA 2.0)

Most teams pour energy into prompt engineering and model selection while their retrieval-augmented generation (RAG) pipeline quietly fails on the unglamorous part: chunking. Industry analysis through 2026 keeps landing on the same uncomfortable conclusion, when RAG produces a wrong or empty answer, the failure is upstream in retrieval far more often than it is in the language model. The frontier model on top cannot answer from a passage your retriever never surfaced. Fix chunking and you fix most of the system.

Quick answer

Start with recursive token splitting at 512 to 1024 tokens with about 10 to 15 percent overlap, then run hybrid retrieval (dense vectors plus BM25) and add a reranker. That baseline beats most fancier approaches. Only after measuring should you layer on Contextual Retrieval (Anthropic, cuts failed retrievals up to 67 percent with reranking) or late chunking (Jina, strongest on long documents) to recover context lost at chunk boundaries. Treat semantic chunking as a last resort, it is roughly 10 times slower and rarely worth it. Never ship a chunking change without an evaluation set.

Key takeaways

Recursive token splitting at 512 to 1024 tokens with modest overlap is the strongest default; it tops several 2026 end-to-end accuracy benchmarks while staying cheap and fast.
Contextual Retrieval (Anthropic) and late chunking (Jina) both attack the same problem, context lost at chunk boundaries, and can be layered on top of a recursive baseline.
Hybrid retrieval (dense vectors plus BM25 keyword search) followed by a reranker delivers bigger wins than obsessing over chunk size.
Semantic chunking is roughly an order of magnitude slower than token splitting; pay for it only when evals prove it helps.
Never ship a chunking change without an evaluation set, "it feels better" is not a metric.

Why naive chunking breaks

The default approach in most tutorials is to split a document every N characters and embed each piece. This destroys context at boundaries. A clause like "the figure above" or "this approach" loses its referent when the sentence it points to lands in a neighboring chunk. The embedding then represents a fragment that means nothing on its own, and retrieval misses the very passage that answers the question.

Anaphora, pronouns and back-references, is where this hurts most. Research through 2026 reports retrieval accuracy gains of 10 to 12% on documents heavy with pronouns and cross-references when you preserve surrounding context instead of slicing it away. Two techniques target this boundary problem directly, and both have measurable wins.

Contextual Retrieval

Anthropic's Contextual Retrieval prepends a short, model-generated description of where each chunk sits in the larger document before embedding it. Each chunk effectively carries its own context label, so a fragment about "Q3 revenue" becomes "This chunk is from ACME's 2026 annual report; it describes Q3 revenue growth." The same enriched text feeds both the embedding index and a BM25 keyword index.

The published numbers are concrete: contextual embeddings plus contextual BM25 cut failed retrievals by about 49%, and adding a reranking step pushes the reduction to roughly 67%. The cost is an extra LLM call per chunk at indexing time, which prompt caching makes affordable for static corpora. If you are weighing the per-chunk LLM cost, the broader economics in the tokenmaxxing shift are worth a look before you index a large archive.

Late chunking

Jina's late chunking inverts the usual order. Instead of splitting first and embedding each piece in isolation, it runs the entire document through a long-context embedding model, then pools token embeddings into chunks afterward. Because every token "saw" the full document during encoding, the resulting chunk vectors preserve cross-boundary meaning. Reported retrieval gains grow with document length, making it especially strong for long, reference-heavy material like contracts and technical manuals.

Tip

You do not have to choose one. Contextual Retrieval and reranking can sit on top of late-chunked embeddings. Layer them only when your retrieval metrics show boundary loss is the actual bottleneck.

A stack of paper documents on a desk being split into smaller annotated sections — Photo: dimnikolov / flickr (BY 2.0)

Picking a chunk size

Chunk size is the parameter everyone tweaks blindly. The 2026 evidence points to a defensible default range rather than a magic number. Multiple vendor benchmarks now rank recursive splitting around 512 tokens at or near the top for end-to-end accuracy, while earlier studies found faithfulness peaking closer to 1024 tokens for long-form technical and legal content. A reasonable rule of thumb:

256 tokens, short Q&A, FAQs, product snippets.
512 tokens, the safe general-purpose default for mixed prose.
1024 tokens, long-form legal, technical, or reference documents.

Use recursive splitting that respects natural boundaries, paragraphs, then sentences, then words, rather than a hard character cut:

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,        # tokens, roughly
    chunk_overlap=120,     # ~15% overlap preserves context
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(document)

Overlap of 10 to 15% is enough to keep a sentence that straddles a boundary intact. Going much higher mostly inflates your index size and retrieval cost without improving recall.

The semantic chunking trade-off

Semantic chunking, splitting where embedding similarity drops between adjacent sentences, sounds ideal and sometimes is. But it carries a real cost: it is roughly an order of magnitude slower than token-based splitting (on the order of 0.33 MB/s versus 4.8 MB/s in published comparisons), because it embeds every sentence just to decide where to cut. For a large corpus that turns minutes of indexing into hours.

The 2026 consensus is that the newer structural upgrades, contextual retrieval, late chunking, and hierarchical (small-to-find, large-to-read) chunking, usually deliver bigger gains than semantic boundary detection or overlap tuning. Pay for semantic chunking only when your retrieval metrics improve enough to justify the compute.

Here is how the main strategies compare on the trade-offs that actually decide which to use:

Strategy	Indexing cost	Best for	When to reach for it
Recursive token split	Cheap, fast	General-purpose default	Always, as your baseline
Hybrid + reranker	Moderate (rerank at query)	Exact IDs plus meaning	Almost always, biggest win for the effort
Contextual Retrieval	LLM call per chunk (cacheable)	Reference-heavy static corpora	When boundary loss persists after hybrid
Late chunking	Long-context embed pass	Long contracts and manuals	Long documents with cross-references
Semantic chunking	~10x slower	Cleanly topic-segmented text	Last resort, only if evals justify it

A practical recipe

If you are starting fresh in 2026, this sequence captures the high-value wins without overengineering. Treat each step as a layer you add only if the previous one leaves measurable headroom.

Split recursively at 512 to 1024 tokens with ~15% overlap.
Embed with a long-context model so late chunking stays an option.
Run hybrid retrieval: dense vectors plus BM25 keyword search.
Add a reranker over the top candidates before passing to the model.
Only then reach for Contextual Retrieval or semantic chunking, measuring each change.

Where you store those vectors matters once the corpus grows, the pgvector vs Qdrant comparison covers the hybrid-search and filtering trade-offs that show up at scale. And if your RAG system feeds a longer-running assistant, the retrieval layer interacts with how you manage agent memory across turns.

Warning

Never ship a chunking change without an evaluation set. Hold out a few dozen real questions with known-good source passages and measure retrieval recall before and after every change.

Measure it, do not feel it

Retrieval recall@k on a held-out question set is the metric that matters: of the questions where the answer lives in your corpus, how often does the right passage appear in the top k results? Track that number as you change chunk size, overlap, and strategy. For judging the downstream answer quality once retrieval is solid, an LLM-as-a-judge evals harness lets you score faithfulness and relevance at scale instead of eyeballing samples. If you run embeddings or rerankers yourself, the throughput characteristics of different local inference engines will shape how affordable per-chunk contextualization actually is.

Frequently asked questions

What chunk size should I start with?

512 tokens is the safest general-purpose default in 2026, with about 10 to 15% overlap. Drop to 256 for short Q&A content and raise to 1024 for long-form legal or technical documents. Always validate against your own evaluation set rather than trusting a single benchmark.

Late chunking or Contextual Retrieval, which one?

They solve the same boundary-loss problem differently and are not mutually exclusive. Late chunking is cheaper at index time (no per-chunk LLM call) and shines on long documents. Contextual Retrieval has stronger published failure-reduction numbers (up to 67% with reranking) but costs an LLM call per chunk. Start with late chunking; add Contextual Retrieval if evals show remaining boundary loss.

Is semantic chunking worth the cost?

Usually not as a first move. It is roughly an order of magnitude slower than recursive token splitting and often loses to recursive-plus-reranking on end-to-end accuracy. Reach for it only when your retrieval metrics flatten and you have ruled out cheaper wins like hybrid search and reranking.

Do I still need BM25 if I have good embeddings?

Yes. Dense vectors capture meaning but miss exact identifiers, product SKUs, error codes, function names, proper nouns. BM25 keyword search catches those, and hybrid retrieval that fuses both consistently beats either alone. Anthropic's contextual BM25 index is a core reason their failure-reduction numbers are as large as they are.

The takeaway

Chunking is where RAG quality is won or lost. Start with recursive 512 to 1024 token chunks and hybrid retrieval, add reranking, and reach for Contextual Retrieval or late chunking when your evals prove boundary loss is hurting you. Measure every step, the failure is almost always upstream of the model.

#ai#rag#retrieval#embeddings