What is the best chunking strategy for RAG?

For most application corpora, hierarchical chunking is the best default: you embed and retrieve small chunks for precision, then expand each hit to its larger parent before generation so the model sees full context. Add semantic chunking when the content is narrative and topic boundaries matter, and layer contextual preambles when chunks reference each other. Keep plain recursive splitting only for already-clean, self-contained articles. The real rule is to match the strategy to the corpus rather than apply one splitter everywhere.

What is the difference between recursive and semantic chunking?

Recursive chunking splits on a fixed list of separators (paragraph breaks, then line breaks, then spaces) until each chunk fits a target size, so it optimizes for size and structure, not meaning. Semantic chunking embeds sentences and starts a new chunk where the similarity between neighboring sentences drops below a threshold, so the boundary lands where the topic actually shifts. Recursive is cheap and deterministic; semantic costs embedding calls at ingestion and needs a tuned threshold, but it stops cutting a single idea in half.

What is hierarchical (parent-child) chunking?

Hierarchical chunking stores two linked layers: small child chunks that you embed and search against, and larger parent chunks that hold the surrounding context. At query time you retrieve on the precise child chunks, then swap each match for its parent before sending text to the model. This gives you the retrieval accuracy of small chunks and the answer quality of large ones. LangChain.js implements it as ParentDocumentRetriever; in LlamaIndex.TS you build the same idea with linked node sizes.

How big should RAG chunks be?

There is no universal number, but a common starting point is roughly 200 to 500 tokens per chunk with a small overlap of 10 to 20 percent so a sentence cut at a boundary still appears whole in one chunk. Smaller chunks retrieve more precisely but carry less context; larger chunks carry more context but dilute the embedding and pull in irrelevant text. Hierarchical chunking sidesteps the trade-off by keeping small chunks for retrieval and large parents for generation, so you tune each layer for its own job.

What is contextual chunking and is it worth the cost?

Contextual chunking prepends a short, LLM-generated description of where each chunk sits in its source document before embedding it, so a chunk that says 'the rate above' becomes searchable by the rate's actual name. Anthropic reported that this style of contextual retrieval cut failed retrievals substantially. The cost is one LLM call per chunk at ingestion, which prompt caching makes affordable for a corpus that changes slowly. It is worth it when chunks rely on cross-references that the chunk text alone does not resolve.

RAG Chunking Strategies: Recursive vs Semantic vs…

The best chunking strategy for most RAG applications is hierarchical: embed and retrieve small chunks so the match is precise, then expand each hit to its larger parent before generation so the model reads full context. Layer semantic chunking on narrative content, add contextual preambles when chunks cross-reference each other, and keep plain recursive splitting only for clean, self-contained articles. Chunking is the cheapest lever on retrieval quality, and the default splitter is usually the wrong one.

The code here targets the TypeScript RAG stack as of June 2026: LlamaIndex.TS node parsers (SentenceSplitter, MarkdownNodeParser) and LangChain.js text splitters (RecursiveCharacterTextSplitter, TokenTextSplitter) with its parent-document retriever. Both libraries name the same ideas slightly differently, so the patterns transfer either way.

TL;DR: which chunking strategy should you use?

Default to hierarchical chunking. Index small child chunks for retrieval precision and store larger parent chunks for generation context, then return the parent of whatever child matched. That single change fixes the most common production complaint, where the answer cites the right document but misses the figure or clause sitting one paragraph away. Add semantic chunking when your corpus is prose that wanders across topics, because letting meaning set the boundary keeps a single idea in one chunk. Layer contextual chunking, a short LLM-written preamble per chunk, when chunks lean on cross-references like "the table above." Keep plain recursive splitting only for content that is already short and self-contained. The mistake is not picking the wrong splitter once; it is applying the same splitter to every document type you own.

A comparison table of RAG chunking strategies

Before the detail, here is the whole decision on one screen. Read it top to bottom as an escalation: each row fixes a failure the row above it leaves on the table, at a higher ingestion cost.

Strategy	What it fixes	Ingestion cost	Best corpus
Recursive	Nothing semantic; only enforces a size limit on structured text	Lowest, pure string ops	Short, self-contained articles and FAQs already split into clean sections
Semantic	Boundaries that cut a single idea in half	Medium, one embedding per sentence group	Long-form prose, transcripts, narrative docs with soft topic shifts
Hierarchical	Answers that retrieve the right spot but miss adjacent context	Low to medium, double storage, simple splits	General app corpora: docs, knowledge bases, mixed content
Contextual	Chunks that reference things outside their own text ("the rate above")	Highest, one LLM call per chunk	Cross-referential docs: contracts, specs, financial reports, manuals

The columns that decide it are the last two. Ingestion cost is a one-time price you pay per document, while retrieval quality is a tax you pay on every query forever. That asymmetry is why chunking is worth real attention: you spend once at ingest to stop paying on every single answer. With the map in place, let's start where almost everyone starts, and where almost everyone gets stuck.

Why recursive chunking breaks in production

Recursive character splitting is the default in every RAG quickstart, and for good reason. It is fast, deterministic, needs no model, and it respects document structure as far as it can. You give it a target size and a list of separators, and it tries to split on paragraph breaks first, then line breaks, then spaces, backing off only when a piece is still too big. Here is the LangChain.js version, which is the one most first RAG passes ship with.

src/ingest/recursive.ts

import { RecursiveCharacterTextSplitter } from '@langchain/textsplitters'

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 200,
})

export async function chunk(text: string): Promise<string[]> {
  return splitter.splitText(text)
}

The chunkSize is a character budget, not a semantic one, and that is the whole problem in one line. The splitter counts characters and stops at the nearest separator under the limit. It has no idea whether character 1000 lands between two paragraphs or in the middle of a sentence that finishes the only definition in the document. The chunkOverlap of 200 characters is a hedge: it repeats the tail of one chunk at the head of the next so a sentence sheared at a boundary still appears whole somewhere. That hedge helps, but it does not change what the splitter optimizes for, which is size.

For genuinely clean input, this is fine. A corpus of short blog posts, each already a tight unit on one topic, splits well on size alone because the structure and the meaning happen to line up. The failures start when they stop lining up, and in a real app they stop lining up constantly.

The "see the table above" failure

The single most common complaint about a first RAG build sounds like a success at first: the system retrieves the correct document, the answer is on-topic, and yet it is wrong or incomplete. You ask for a figure and get the sentence that introduces the figure without the figure itself. You ask about a pricing tier and get the paragraph that says "as shown in the table above" while the table sits in a different chunk that never got retrieved.

This is recursive splitting working exactly as designed. A 1000-character window fell between the prose and the structured element it refers to, so embedding put them in different vectors, and the query matched the prose because that is where the words live. The table, which has few of the query's words, scored lower and never made the top results. The retrieval was not broken. The chunking severed a reference that the document assumed you would read together, and no amount of reranking downstream can rejoin two chunks that were split apart at ingest.

Overlap does not save you here, because a table or a code block is usually far larger than a 200-character overlap window. The reference and its target are too far apart to co-occur in one chunk. You can widen the window, but a 4000-character chunk dilutes the embedding so badly that retrieval precision collapses, which is the opposite problem. This is the trade-off hierarchical chunking exists to dissolve.

Code and clauses split mid-unit

The same blade cuts structured text. Run recursive splitting over a documentation site that mixes prose and code, and watch where the boundaries land. A function gets cut after its signature, so the chunk with the signature has no body and the chunk with the body has no name. A legal clause gets split between its condition and its exception, so a query about the exception retrieves a chunk that reads as an unconditional rule. The meaning inverts.

Generic separators are the cause. The default separator list knows about paragraphs and lines, not about the boundaries of a TypeScript function or a numbered contract clause. LangChain.js does ship a structure-aware option for code through RecursiveCharacterTextSplitter.fromLanguage, which swaps in separators tuned to a language's syntax.

src/ingest/code.ts

import { RecursiveCharacterTextSplitter } from '@langchain/textsplitters'

const codeSplitter = RecursiveCharacterTextSplitter.fromLanguage('js', {
  chunkSize: 1200,
  chunkOverlap: 0,
})

export function chunkSource(source: string) {
  return codeSplitter.createDocuments([source])
}

fromLanguage('js') loads separators that prefer to break between functions and classes rather than inside them, so a unit of code tends to survive as one chunk. LlamaIndex.TS offers the same idea through a dedicated CodeSplitter in @llamaindex/node-parser/code and a MarkdownNodeParser that splits on heading structure. The point is not that recursive splitting is unusable. It is that the plain character splitter is the wrong tool the moment your content has internal units, and the fix starts with picking a splitter that knows what those units are. The deeper fix is to stop forcing one chunk to do both jobs.

Semantic chunking: letting meaning set the boundary

Recursive splitting asks "where is the next separator before I run out of budget?" Semantic chunking asks a better question: "where does the topic actually change?" Instead of a fixed size, it walks the document sentence by sentence, embeds each one, and compares neighbors. While consecutive sentences stay similar in embedding space, they belong to the same chunk. When the similarity drops past a threshold, that gap is a real topic shift, and the chunk ends there.

The result is chunks of uneven length that each hold one coherent idea. A dense, single-topic section becomes one large chunk because nothing inside it crosses the threshold. A section that hops between three subtopics becomes three chunks split exactly at the seams. Size stops being the thing you control and becomes a side effect of meaning, which is the right way around for prose.

The tuning knob is the threshold, and it is a genuine trade-off rather than a value you can set and forget. Set the similarity threshold too sensitive and every minor sentence-to-sentence shift triggers a split, so you get tiny fragments that have the same context-starvation problem as small recursive chunks. Set it too loose and unrelated topics get merged into one bloated chunk whose embedding is an average of several ideas and matches none of them well. Most teams calibrate the threshold against a handful of real queries, watch which chunks come back, and adjust, rather than trusting a default.

There is a cost worth naming. Semantic chunking embeds every sentence at ingestion, which is many more embedding calls than recursive splitting's zero. For a large corpus that is real money and real time, paid once per document. The libraries lean toward sentence-respecting splitters as the practical middle ground. LlamaIndex.TS makes its default node parser sentence-aware, so chunks break on sentence boundaries rather than mid-clause even before you reach for full embedding-based splitting.

src/ingest/sentence.ts

import { SentenceSplitter } from 'llamaindex'

const splitter = new SentenceSplitter({
  chunkSize: 512,
  chunkOverlap: 50,
})

export function chunk(text: string): string[] {
  return splitter.splitText(text)
}

SentenceSplitter defaults to chunkSize: 1024 and chunkOverlap: 200 tokens, and crucially it measures in tokens and refuses to break inside a sentence. Dropping chunkSize to 512 here makes chunks more retrieval-precise while the sentence-boundary guarantee keeps each one readable. This is not full semantic chunking, since the boundary is still size-driven, but it removes the mid-sentence cuts that hurt the most for a fraction of the embedding cost. Reach for true embedding-based semantic splitting when the content is long-form prose whose topics drift within a section and sentence boundaries alone are not enough. For everything else, the bigger win usually comes from a different direction entirely.

Hierarchical chunking: retrieve small, generate large

Every strategy so far fights the same losing battle: one chunk size has to be small enough to retrieve precisely and large enough to answer completely, and no single number is both. Hierarchical chunking refuses the premise. It keeps two linked layers of chunks and uses each for the job it is actually good at.

The small layer, the children, is what you embed and search. Small chunks make sharp embeddings, so retrieval lands on the exact sentence that answers the query. The large layer, the parents, is what you send to the model. When a child chunk matches, you do not pass that thin sliver to the model; you look up its parent, the larger section the child came from, and pass that instead. Retrieval gets the precision of small chunks and generation gets the context of large ones, from the same query. The "see the table above" failure mostly disappears, because the parent that contains the matched prose usually contains the table too.

LangChain.js ships this as the parent-document retriever, which wires the two layers together for you.

src/retrieval/parent-document.ts

import { ParentDocumentRetriever } from 'langchain/retrievers/parent_document'
import { RecursiveCharacterTextSplitter } from '@langchain/textsplitters'
import { MemoryVectorStore } from 'langchain/vectorstores/memory'
import { InMemoryStore } from '@langchain/core/stores'
import { OpenAIEmbeddings } from '@langchain/openai'

const retriever = new ParentDocumentRetriever({
  vectorstore: new MemoryVectorStore(new OpenAIEmbeddings()),
  byteStore: new InMemoryStore<Uint8Array>(),
  parentSplitter: new RecursiveCharacterTextSplitter({ chunkSize: 2000, chunkOverlap: 0 }),
  childSplitter: new RecursiveCharacterTextSplitter({ chunkSize: 400, chunkOverlap: 0 }),
  childK: 20,
  parentK: 5,
})

await retriever.addDocuments(docs)
const results = await retriever.invoke('what is the enterprise rate?')

Two splitters, two sizes, doing two jobs. The childSplitter cuts 400-character chunks that get embedded into the vectorstore, so the similarity search runs against precise units. The parentSplitter cuts 2000-character chunks that get stashed in the byteStore keyed by ID, never embedded. On invoke, the retriever searches the child vectors, takes the top childK matches, maps each back to its parent, and returns the deduplicated parents, capped at parentK. The model receives 2000-character sections that contain the matched sentence plus everything around it. You tuned retrieval and context independently, which is the whole trick.

LlamaIndex.TS expresses the same pattern by parsing a document into nodes of different sizes and linking children to parents, then expanding to the parent at query time. The vocabulary differs, but the shape is identical: small to find, large to answer. The verified APIs for both live in the LlamaIndex.TS node parsers docs and the LangChain.js RAG tutorial.

This is the strategy I reach for first on a general app corpus, and the reason is the cost profile. Hierarchical chunking needs no model at ingestion, only two cheap string-level splits and double the storage, yet it removes the failure that makes most first RAG builds feel broken. Low cost, high payoff. There is one operational wrinkle to handle before it is production-ready.

Parent-child pointers and deduplication

The pointer from child to parent is the load-bearing part. Each child chunk stores the ID of the parent it was cut from, and the parents live in a key-value store, not the vector index. Get this wiring wrong and you either lose the link (so expansion silently returns nothing) or you embed the parents by accident (so retrieval precision collapses back to large-chunk behavior). In the parent-document retriever, the byteStore holds the parents by ID and only the children reach the vectorstore, which is exactly the separation you want.

Deduplication is the wrinkle people miss until results look strange. When a query is a strong match, the top child results often come from the same parent section, several sentences that each scored well. Map each child to its parent naively and you hand the model the same 2000-character parent three or four times. That wastes context budget, and worse, the repetition can bias the model toward over-weighting that one section. You have to dedupe parents by ID after expansion and before generation.

src/retrieval/dedupe.ts

interface Child {
  text: string
  parentId: string
}

export function expandToParents(children: Child[], parents: Map<string, string>): string[] {
  const seen = new Set<string>()
  const out: string[] = []
  for (const child of children) {
    if (seen.has(child.parentId)) continue
    seen.add(child.parentId)
    const parent = parents.get(child.parentId)
    if (parent) out.push(parent)
  }
  return out
}

Walking children in score order and skipping any parent already seen keeps the best-ranked unique sections and drops the duplicates. The order matters: because the loop runs over children sorted by relevance, the first time you encounter a parent is via its highest-scoring child, so the parent enters the result at its best rank. The parent-document retriever does this internally, which is one reason to use the library version over hand-rolling the lookup. If you build the pattern yourself in LlamaIndex.TS or against a raw vector store, this dedupe step is not optional. Skip it and a perfect retrieval turns into a context window full of the same paragraph.

Contextual chunking: an LLM preamble per chunk

Hierarchical chunking solves missing adjacent context, but it cannot help a chunk that references something the parent does not contain either. A clause that says "subject to the limits in Section 4" is meaningless on its own, and if Section 4 is ten pages away, no reasonable parent size reaches it. The chunk is locally complete and globally orphaned. This is where contextual chunking earns its cost.

The idea, which Anthropic published as contextual retrieval, is to prepend a short, document-aware description to each chunk before you embed it. You send the chunk plus the whole document to an LLM and ask it to write a sentence or two situating the chunk: what section it belongs to, what the "rate above" actually is, which entity "the company" refers to. That preamble gets embedded along with the chunk text, so the vector now carries the cross-reference the raw chunk lacked. A query for "enterprise tier annual price" can match a chunk that literally only said "the rate above," because the preamble spells out that the rate is the enterprise annual price.

src/ingest/contextual.ts

import Anthropic from '@anthropic-ai/sdk'

const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY ?? '' })

export async function contextualize(documentText: string, chunk: string): Promise<string> {
  const response = await client.messages.create({
    model: 'claude-haiku-4-5-20251001',
    max_tokens: 100,
    system: 'Write one short sentence situating the chunk within the document. Output only that sentence.',
    messages: [
      {
        role: 'user',
        content: `<document>${documentText}</document>\n<chunk>${chunk}</chunk>`,
      },
    ],
  })
  const block = response.content[0]
  const preamble = block && block.type === 'text' ? block.text : ''
  return `${preamble}\n\n${chunk}`
}

The function returns the chunk with its situating sentence glued on top, and that combined string is what you embed and store. A cheap, fast model like Haiku is the right call, since the task is summarization with the full document in front of it, not deep reasoning. Note the structure of the prompt: the whole document goes in every call so the model can resolve references the chunk cannot.

That structure is also the cost, and it is the highest of any strategy here: one LLM call for every chunk in your corpus. The reason it is affordable in practice is prompt caching. The document text is identical across all the calls for that document, so caching the document portion of the prompt collapses the marginal cost of each additional chunk to roughly the price of the chunk and the short output. Anthropic reported that contextual retrieval cut the failure rate of top-results retrieval substantially, and combining it with a reranking step pushed the reduction further. For a corpus that changes slowly, like a contract set, a product spec, or a financial filing, you pay this ingestion cost once and every query afterward benefits. For a corpus that churns hourly, the recurring re-contextualization cost is harder to justify, and you would reserve it for the highest-value document types. Contextual chunking is a precision tool, not a default, and that is the thread running through all of this.

Matching chunking strategy to corpus

Here is the decision framework, because the answer to "which strategy?" is always "for which content?" Run your corpus through these questions in order and stop at the first match.

Are the documents already short and self-contained, like a FAQ where every entry is a tight unit on one topic? Plain recursive splitting is enough, and anything fancier is wasted ingestion cost. Do not pay for embeddings or LLM calls to solve a problem you do not have.

Is the content long-form prose where topics drift inside a single section, like transcripts, articles, or reports? Reach for semantic or at least sentence-aware splitting, so the boundary lands where the meaning changes rather than where the character count runs out. This is where letting meaning set the boundary pays for its embedding cost.

Is it a general application corpus, a mix of docs and knowledge-base pages and structured content where answers keep missing adjacent context? Default to hierarchical. Small children for retrieval, large parents for generation, and the most common production failure is gone for almost no ingestion cost. This is the case most apps are actually in, which is why hierarchical is the default rather than recursive.

Do chunks reference things outside their own text, like contracts with "subject to Section 4" or specs with "the value above"? Layer contextual preambles on top of whichever base strategy you chose, accepting the per-chunk LLM cost because the cross-references will otherwise sink your retrieval no matter how you slice.

The framework composes rather than forcing a single winner. A real system often runs recursive splitting on its short FAQ, hierarchical on its main docs, and contextual hierarchical on its contracts, all in the same pipeline. Routing by document type is the actual recommendation. The instinct to centralize an awkward integration once and keep the rest of the app clean is the same one behind building a model-agnostic AI layer with fallbacks: isolate the messy, content-specific decision in the ingestion layer so the retrieval and generation code stays uniform. The pipeline asks "what kind of document is this?" and dispatches to the right chunker, and everything downstream sees the same chunk shape regardless of how it was produced.

Validating chunk quality before you commit

Re-chunking a large corpus is expensive, so you want to know a strategy works before you run it across everything. The mistake is to judge chunking by eyeballing the chunks, which tells you they look reasonable and nothing about whether they retrieve. Chunk quality is a retrieval property, not a text property, so you have to measure it as one.

Build a small evaluation set first: twenty to fifty real questions, each paired with the chunk or chunks that genuinely answer it. This is manual work and there is no shortcut, but it is the only thing that turns "the chunks look fine" into a number. It is the same move as testing AI-generated code instead of trusting it: you do not approve output because it reads well, you approve it because a test held. Then, for each strategy you are weighing, run the questions through retrieval and measure how often the correct chunk appears in the top results.

src/eval/recall.ts

interface EvalCase {
  query: string
  relevantIds: string[]
}

export async function recallAtK(
  cases: EvalCase[],
  retrieve: (query: string, k: number) => Promise<string[]>,
  k: number,
): Promise<number> {
  let hits = 0
  for (const { query, relevantIds } of cases) {
    const retrievedIds = await retrieve(query, k)
    const found = relevantIds.some(id => retrievedIds.includes(id))
    if (found) hits += 1
  }
  return hits / cases.length
}

recallAtK returns the fraction of questions where at least one genuinely relevant chunk made the top k results, which is the metric that actually tracks answer quality. Run it once per chunking strategy against the same eval set and the same k, and the comparison stops being a matter of taste. If hierarchical scores 0.9 and recursive scores 0.6 on your content, you have your answer and a number to show for it, rather than a hunch. Watch recall at a small k, like 3 or 5, because that mirrors how many chunks you actually feed the model. A strategy that only finds the right chunk at k of 50 is not helping a generation step that reads the top 5.

This is also how you set the semantic threshold and the parent and child sizes from earlier, rather than guessing. Sweep a couple of values, run the eval, keep the winner. The numbers do the deciding, and you commit to a full re-chunk knowing what you are buying. That is the difference between chunking as a hunch and chunking as an engineering decision.

Conclusion

Chunking is the highest-impact, lowest-glamour part of a RAG pipeline. It runs once at ingestion, it needs no GPU, and it decides more about answer quality than the reranker or the prompt you will spend next week tuning. The reason most first builds disappoint is not the model and not the vector store; it is a default splitter optimizing for character count on content whose meaning does not respect character count.

If you change one thing, make it hierarchical retrieval: small chunks to find the answer, large parents to ground it. From there, let the corpus tell you what else it needs. Narrative prose wants semantic boundaries, cross-referential documents want contextual preambles, and a tidy FAQ wants nothing more than recursive splitting. Build the eval set before you commit, because the only way to know a chunking strategy works is to watch it retrieve. The next time an answer cites the right document and misses the figure, you will know the fix is upstream of the model, in how you cut the text.

RAG Chunking Strategies: Recursive vs Semantic vs Hierarchical