How do I evaluate a RAG system?

Evaluate each component separately instead of scoring the final answer as one number. Measure retrieval with recall@k, precision@k, and MRR against a labeled set of question-to-chunk pairs, then measure generation in isolation by feeding the model the correct context and scoring faithfulness and answer relevance. A single end-to-end score tells you the answer was bad but not which stage broke, so component scores are what point you at the fix.

Why does my RAG return irrelevant or wrong answers?

The most common cause is retrieval, not the generation model. If the chunk that holds the answer was never in the top-k results, no prompt change and no stronger model can recover it, because the model never saw the fact. Measure retrieval recall@k first. If the right chunk is missing from the retrieved set, the bug is in chunking, embeddings, or the query, and swapping the answer model is wasted effort.

What is recall@k in retrieval evaluation?

Recall@k is the share of relevant chunks that appear in the top k retrieved results for a query. If a question has two relevant chunks and your retriever returns one of them in the top 5, recall@5 is 0.5. It is the first metric to check when answers are bad because it directly measures whether the evidence reached the model at all.

Is LLM-as-a-judge reliable for RAG evaluation?

It is reliable enough to scale evaluation, but only after you calibrate it against human scores on a sample and correct for known biases. The MT-Bench study by Zheng et al. found strong LLM judges agree with humans over 80 percent of the time, about the same as humans agree with each other, while also documenting position, verbosity, and self-enhancement bias. Treat the judge as a noisy instrument you check against a human-labeled set, not as ground truth.

Can I evaluate RAG in TypeScript instead of Python?

Yes. The autoevals library from Braintrust ships RAG scorers in JavaScript and TypeScript, including ContextRecall, ContextPrecision, and Faithfulness, so a Node or full-stack team does not need the Python data stack to run these checks. Ragas remains a Python library, so if you reach for it you are adding a Python service, whereas autoevals runs in the same language as your app.

RAG Evaluation: Measure the Component, Not the Answer

When a RAG answer is wrong, stop swapping the generation model and measure the pipeline by component. A single end-to-end score tells you the answer failed, never where, so teams spend weeks tuning the wrong stage. Score retrieval and generation separately, and make retrieval recall your first diagnostic, because no prompt work rescues an answer when the relevant chunk was never retrieved. The broken stage is usually upstream of the model you keep replacing.

For JavaScript and TypeScript teams, the in-language evaluation path as of June 2026 is the autoevals package from Braintrust, which ships RAG scorers (ContextRecall, ContextPrecision, Faithfulness) that run in Node. Ragas, the reference implementation most metric definitions are written against, is Python only, so reaching for it means standing up a Python service alongside your app.

TL;DR: diagnose the component, not the answer

A bad RAG answer is a symptom with at least four possible causes, and an end-to-end score cannot tell them apart. Break the pipeline into stages and put a number on each one. Measure retrieval with recall@k, precision@k, and MRR against a set of question-to-chunk labels, so you know whether the right evidence even reached the model. Measure generation in isolation by handing the model the correct context and scoring faithfulness and answer relevance, so you know whether the model misused evidence it actually had. Run retrieval first, because it is the cheapest to measure and the most common point of failure. LLM-as-judge scales the generation scoring once you have calibrated it against human labels and corrected for its biases. JavaScript and TypeScript teams can run all of this with autoevals today, no Python required.

Why an end-to-end score hides the real failure

The instinct when a RAG feature returns a vague or wrong answer is to reach for a stronger generation model. It feels like the model is the part that "writes the answer," so the model must be at fault. That instinct is what burns weeks. The answer is the last link in a chain, and an end-to-end score grades the whole chain with one number, which means it grades the link you can see and hides the three you cannot.

Consider what actually happens between a user's question and the text they read back. The question gets embedded and used to search a vector store. The store returns a set of candidate chunks. Those chunks are sometimes reranked. The top chunks are stuffed into a prompt, and the model writes a completion. A wrong final answer can come from any of those steps, and they fail in completely different ways. A retrieval miss means the fact was never in the prompt. A generation miss means the fact was in the prompt and the model ignored it or contradicted it. The fix for the first is nothing like the fix for the second.

This is the same diagnostic discipline behind a layered strategy for testing AI-generated React code: you do not assert on the final rendered screen and call it a test, you put a contract at each boundary so a failure names its own location. A RAG pipeline deserves the same treatment. One number at the end is a smoke alarm. It tells you something is burning, not which room.

Here is the concrete cost of skipping component scores. A team sees a 40% "answer correctness" rate end to end and spends a sprint A/B testing prompts and upgrading the generation model from a mid-tier to a frontier tier. Correctness moves three points. Then someone finally measures retrieval and finds recall@5 sitting at 55%, meaning the correct chunk was absent from the prompt almost half the time. No prompt rescues a missing fact. The sprint was spent polishing the one stage that was already working.

The four places a RAG answer can break

Before measuring anything, you need a mental map of where failures live, because each location has its own metric and its own fix. A RAG answer passes through four stages, and a bad answer traces to at least one of them.

The first stage is the query. A user's question gets turned into a search query, often by embedding it directly, sometimes after a rewrite step. If the question is ambiguous, or the rewrite mangles it, or the embedding model maps it far from the relevant documents, the search starts from the wrong place. Symptoms here look like retrieval failures, which is why query problems often get misdiagnosed.

The second stage is retrieval. Your query hits the vector store and the store returns the top k chunks by similarity. This is where most production RAG breaks, and it breaks quietly. Chunking that splits a fact across two chunks, an embedding model that does not capture your domain's vocabulary, or a k that is too small can all leave the answer-bearing chunk out of the result set. When that happens, every downstream stage is working with incomplete evidence and cannot recover.

The third stage is reranking, when present. Many pipelines retrieve a wide set (say, the top 50) cheaply, then use a slower cross-encoder or LLM reranker to reorder them and keep the best few. A reranker can rescue a pipeline whose first-pass retrieval is noisy, and it can also sink one by demoting the right chunk below the cutoff. Reranking is measured with the same retrieval metrics, applied to the reranked order rather than the raw similarity order.

The fourth stage is generation. Here the model receives the chunks and writes the answer. It can hallucinate a fact that is not in the context, contradict the context, or answer a different question than the one asked. These are generation failures, and they are the only ones a better generation model or a better prompt can fix. The whole point of component evaluation is to confirm you are in this stage before you spend effort here.

Notice the asymmetry. Three of the four stages are about getting the right evidence in front of the model, and only the last is about what the model does with it. Most teams instrument only the last one. That is the gap.

Evaluating retrieval with recall@k, precision@k, and MRR

Retrieval is where to start because it is the most common failure and the cheapest to measure. It does not need an LLM judge or a subjective scale. It needs a set of questions, the chunk IDs that actually answer each question, and three classic information-retrieval metrics computed against what your retriever returned. The metrics answer three different questions, and you want all three.

Recall@k answers "did the right chunks make it into the top k?" It is the share of relevant chunks for a query that appear in the top k results. If a question has two relevant chunks and the retriever surfaces one of them in the top 5, recall@5 is 0.5. This is the headline metric, because a fact that is not in the top k is a fact the model never sees, and the Ragas docs define context recall the same way: "how many of the relevant documents (or pieces of information) were successfully retrieved."

Precision@k answers "how much of what I retrieved was actually relevant?" It is the share of the top k results that are relevant. If 2 of the top 5 chunks are relevant, precision@5 is 0.4. Low precision is not as fatal as low recall, because a strong model can ignore some irrelevant chunks, but it costs you context-window budget and can distract the model, and the Ragas context precision metric captures the same idea with rank-awareness.

MRR (mean reciprocal rank) answers "how high up is the first relevant chunk?" For each query you take the reciprocal of the rank of the first relevant result (1 if it is first, 1/2 if second, 1/3 if third), then average across queries. MRR rewards putting the right chunk near the top, which matters because models weight earlier context more heavily and because a reranker's whole job is to lift the right chunk up the list.

A small worked example makes the difference between the three concrete. Take one query with two relevant chunks, c3 and c7. The retriever returns this top 5 in order: c1, c3, c2, c9, c4. Recall@5 is 1 of 2 relevant found, so 0.5. Precision@5 is 1 relevant out of 5 returned, so 0.2. The first relevant chunk c3 sits at rank 2, so the reciprocal rank is 1/2, giving MRR 0.5 for this single query. Three numbers, three different stories: half the evidence is missing, most of what came back is noise, and the one good chunk that did arrive landed near the top.

Recall@k in TypeScript is a few lines, and writing it yourself is worth doing once so the metric is not a black box.

src/eval/retrieval-metrics.ts

export interface RetrievalCase {
  query: string
  relevantIds: string[]
  retrievedIds: string[]
}

export function recallAtK(testCase: RetrievalCase, k: number): number {
  const relevant = new Set(testCase.relevantIds)
  if (relevant.size === 0) return 0

  const topK = testCase.retrievedIds.slice(0, k)
  const found = topK.filter((id) => relevant.has(id)).length
  return found / relevant.size
}

export function meanRecallAtK(cases: RetrievalCase[], k: number): number {
  if (cases.length === 0) return 0
  const total = cases.reduce((sum, c) => sum + recallAtK(c, k), 0)
  return total / cases.length
}

The function takes the relevant chunk IDs you labeled and the IDs your retriever returned, intersects the top k with the relevant set, and divides by how many relevant chunks existed. Computing this in your own code rather than only inside an eval library matters because retrieval IDs are yours: you control chunking and storage, so you can produce retrievedIds from your real retriever and relevantIds from your labels without any model in the loop. That makes recall@k deterministic, fast, and free to run on every change to chunking or embeddings, which is exactly the property you want from the first diagnostic you reach for.

Run recall at more than one k. Recall@1 tells you whether the single best result is usually right, which matters for a pipeline that feeds only the top chunk. Recall@10 tells you whether the evidence is somewhere in a wider net, which matters when a reranker runs next. A pipeline with high recall@10 but low recall@1 has a ranking problem, not a retrieval problem, and that distinction changes whether you tune embeddings or add a reranker.

Evaluating generation in isolation with faithfulness given gold context

Once retrieval is measured, the generation stage has to be tested on its own, and the trick is to remove retrieval from the experiment entirely. Feed the model the correct context by hand, the gold chunks you already labeled as relevant, and then score the answer. This isolates one question: given the right evidence, does the model produce a grounded, on-target answer? If generation scores well on gold context but the end-to-end pipeline scores badly, you have proven the problem is upstream in retrieval, not in the model.

The headline generation metric is faithfulness, sometimes called groundedness. It measures whether the claims in the answer are supported by the provided context. The Ragas faithfulness metric defines it as how factually consistent the response is with the retrieved context, computed by breaking the answer into claims and checking what share of those claims the context supports. A low faithfulness score on gold context is the clearest signal of hallucination you can get, because the evidence was right there and the model still asserted something the evidence did not back.

Faithfulness is not the whole picture, which is why answer relevance sits beside it. An answer can be perfectly faithful and still useless: a response that restates the context verbatim without addressing the question is grounded but unhelpful. Answer relevance scores how well the response actually answers the question that was asked. The pair pulls in two directions on purpose. Faithfulness punishes making things up; answer relevance punishes dodging the question. A good answer needs both, and scoring them separately tells you which way a weak answer is failing.

There is a precision-style metric for generation too, worth naming so the four-quadrant picture is complete. Faithfulness asks whether the answer is supported by the context. Context precision and recall ask whether the context was the right context. Answer relevance asks whether the answer addressed the question. Together they let you say something specific, such as "retrieval is fine, the model is grounded, but it keeps answering a broader question than the user asked," which is a prompt fix, not a retrieval fix.

LLM-as-a-judge: scaling evaluation without trusting it blindly

Faithfulness and answer relevance cannot be computed with set intersection the way recall@k can, because judging whether a claim is "supported" or an answer is "relevant" is a language task. The scalable way to do it is to ask another LLM to score the answer, the pattern called LLM-as-a-judge. It is what lets you grade a thousand answers overnight instead of paying humans to read them. It is also a measurement instrument with known, repeatable errors, and using it without correcting for those errors gives you confident numbers that are quietly wrong.

The reassuring part first. A capable judge agrees with humans often enough to be useful. The MT-Bench study by Zheng et al. found that "strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans." That is the empirical basis for using a judge at all: it is roughly as consistent with a human as a second human would be. You can read the full study, Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, for the methodology behind that number.

The catch is that the same paper documents three biases you have to account for, and they are not noise that averages out. They are systematic, so they push your scores in a consistent direction.

Position, verbosity, and self-preference bias

The same MT-Bench study names the three failure modes plainly: it examines "position, verbosity, and self-enhancement biases" in LLM judges. Each one distorts scores in a way you can predict and therefore guard against.

Position bias is the judge preferring whichever answer it sees first (or, depending on the model, last) when comparing two responses. The fix is to run every pairwise comparison twice with the order swapped and only count a win when the judge picks the same answer both times. Anything that flips when you flip the order is a coin toss the judge dressed up as a verdict.

Verbosity bias is the judge rewarding longer answers regardless of whether the extra length adds correct information. This one is dangerous for RAG specifically, because a model that dumps the entire retrieved context into its answer looks thorough to a naive judge while actually scoring worse on answer relevance. Guard against it by scoring faithfulness and relevance separately rather than asking for one overall quality score, and by including answer length as a control when you analyze results.

Self-preference bias, called self-enhancement in the paper, is a judge rating outputs from its own model family more highly than outputs from other families. If the model that generates your answers is the same model that judges them, your scores are inflated by an amount you cannot see. The practical defense is to use a different model as judge than the one under test, or at minimum to disclose that they match so the number is read with suspicion.

Calibrating the judge against human scores

Knowing the biases exist is not enough; you have to measure how far off your specific judge is on your specific task, and that means a calibration set. Take a sample of answers, label them by hand with the scores you trust, then run the judge over the same sample and compare. The agreement rate between the judge and your human labels is the only honest statement you can make about how much to trust the judge's verdicts on the rest of the data.

Calibration does two things. It gives you a number to report ("the judge agrees with our human labels 84% of the time on faithfulness") so nobody mistakes the judge's output for ground truth. And it surfaces the direction of the error, so if the judge is systematically more lenient than your labelers, you know to read its absolute scores as an upper bound. The judge is fine for tracking relative change across pipeline versions, which is what you mostly want. It is not fine as an unaudited source of truth, and a calibration set is what keeps you on the right side of that line. Re-run the calibration whenever you change the judge model, because a new model version can shift both the agreement rate and the biases.

Generating a test set from your own corpus without drowning in labeling

Every metric above needs labeled data: questions paired with the chunks that answer them, and ideally reference answers. The reason teams skip component evaluation is that hand-labeling hundreds of question-to-chunk pairs sounds like a quarter of work nobody has time for. There is a faster path that gets you a usable first test set in an afternoon, and it runs the corpus through an LLM in reverse.

The generation direction is the shortcut. Instead of writing questions and then hunting for the chunk that answers each one, take a chunk you already have and ask a model to generate a question that the chunk answers. Now the labeling is free: the chunk that produced the question is, by construction, a relevant chunk for that question. Do this across a sample of your corpus and you have question-to-chunk pairs without a human writing a single label.

src/eval/generate-testset.ts

import Anthropic from '@anthropic-ai/sdk'

const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY ?? '' })

export interface Chunk {
  id: string
  text: string
}

export interface GeneratedCase {
  query: string
  relevantIds: string[]
}

export async function generateCase(chunk: Chunk): Promise<GeneratedCase> {
  const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 256,
    system:
      'You write one specific question that the given passage fully answers. ' +
      'The question must be answerable from the passage alone. ' +
      'Return only the question text, no preamble.',
    messages: [{ role: 'user', content: chunk.text }],
  })

  const block = response.content[0]
  const query = block && block.type === 'text' ? block.text.trim() : ''
  return { query, relevantIds: [chunk.id] }
}

The function takes a chunk, asks the model for a question that the chunk answers, and pairs the generated question with that chunk's ID as the known-relevant label. Running it across a sample of chunks gives you a synthetic retrieval test set you can feed straight into the recallAtK helper from earlier. The honesty cost is worth stating: a generated question is cleaner and more on-topic than a real user's messy phrasing, so synthetic recall scores tend to run optimistic. Treat this set as a fast baseline and a regression guard, not as a substitute for a smaller set of real user queries you label by hand. The synthetic set tells you when a change made retrieval worse; the real set tells you how good retrieval actually is.

Two refinements make the set less artificial. Generate a few questions per chunk at different specificities, so the set is not all narrow lookups. And periodically mine your production logs for real questions and label those, even twenty of them, because twenty real queries catch phrasing failures that a thousand synthetic ones never will. As for size, there is no magic number, but a few hundred synthetic cases plus a couple of dozen hand-labeled real ones is enough to make recall@k move meaningfully when you change chunking, which is the whole point of having the set.

Running RAG evals in TypeScript with autoevals

The case for staying in TypeScript is not stubbornness. Most of the metric tooling for RAG was written in Python, and the usual advice is to bolt a Python service onto your stack just to run evals. For a React, Vue, or Nuxt team that ships in TypeScript, that is a second runtime, a second dependency graph, and a context switch every time you touch the evals. The autoevals package from Braintrust removes that tax by shipping the RAG scorers as JavaScript and TypeScript functions you call inline, in the same language as the feature you are testing. Install it with npm install autoevals.

The scorers you want for RAG are ContextRecall, ContextPrecision, and Faithfulness. They are async functions that take an object with input, output, expected, and context fields and return a result whose score is a number between 0 and 1 with supporting detail in metadata. The context field accepts a string or an array of strings, which maps directly onto the chunks your retriever returned.

src/eval/autoevals-rag.ts

import { ContextRecall, ContextPrecision, Faithfulness } from 'autoevals'

export interface RagSample {
  question: string
  answer: string
  reference: string
  retrievedChunks: string[]
}

export async function scoreSample(sample: RagSample) {
  const recall = await ContextRecall({
    input: sample.question,
    output: sample.answer,
    expected: sample.reference,
    context: sample.retrievedChunks,
  })

  const precision = await ContextPrecision({
    input: sample.question,
    output: sample.answer,
    expected: sample.reference,
    context: sample.retrievedChunks,
  })

  const faithfulness = await Faithfulness({
    input: sample.question,
    output: sample.answer,
    context: sample.retrievedChunks,
  })

  return {
    contextRecall: recall.score,
    contextPrecision: precision.score,
    faithfulness: faithfulness.score,
  }
}

Each scorer is doing the LLM-as-a-judge work described earlier, wrapped so you call it like any other async function. ContextRecall and ContextPrecision need the expected reference answer because they judge the retrieved context against what a correct answer requires. Faithfulness does not take expected, because it judges the answer only against the context it was given, which is exactly the gold-context isolation test from before. The scorers default to an LLM judge under the configured provider, so the position, verbosity, and self-preference caveats apply here too: calibrate these scores against a human-labeled sample before you trust their absolute values, and use them mainly to track movement between pipeline versions.

These autoevals scorers are LLM-judged and therefore non-deterministic, which is the opposite of the hand-written recallAtK helper. That contrast is the practical division of labor. Compute the deterministic, ID-based retrieval metrics yourself so they run instantly and identically on every commit, and reach for the autoevals judge-based scorers for the language-level judgments (faithfulness, context quality given a reference) that genuinely need a model. Wire the whole thing into the same continuous-integration gate you already use for tests, the way a hardened AI-generated React app treats every AI boundary as something to verify before it ships, so a change that tanks recall fails the build instead of reaching users. The same review reflex that an AI code review checklist for React and Vue applies to generated code applies here to the eval thresholds: decide what score is allowed to merge, and enforce it.

The recommendation: retrieval recall is your first diagnostic

If you take one rule from all of this, make it the order of operations. When a RAG answer is bad, do not touch the generation model until you have measured retrieval recall@k. The reason is mechanical, not stylistic: retrieval recall is the only metric that tells you whether the answer was even possible. A fact that never entered the top-k context cannot be generated, no matter how good the model or the prompt, so a low recall score caps every downstream metric and renders generation tuning pointless until you fix it.

The diagnostic sequence falls out of that. Measure recall@k first, with your own deterministic helper, against a test set you generated from your corpus. If recall is low, the bug is in chunking, embeddings, the query, or k, and you stay upstream. If recall is healthy, move to generation: feed the model gold context and score faithfulness and answer relevance with autoevals. If faithfulness is low on gold context, now and only now is the generation model or the prompt the suspect. Layer the LLM judge on top once it is calibrated, and use it to watch the trend across versions rather than as an oracle.

This is deliberately the opposite of the default loop, where a bad answer triggers a model swap and a prompt rewrite. That loop sometimes works, which is what makes it seductive and expensive, because it works often enough to keep you doing it while the real fault sits in a retriever you never measured. Component evaluation replaces the guess with a number that names the broken stage. The judge bias work matters because the moment you scale generation scoring with an LLM, you inherit position, verbosity, and self-preference distortions, and a calibration set is the cheap insurance that keeps those scaled numbers honest.

Conclusion

The pipeline you cannot measure by component is a pipeline you tune by superstition. That same instinct, reaching for a bigger model, is the one worth resisting, because the answer is the one stage in a RAG system that is downstream of everything else that can go wrong. Put a number on retrieval, put a separate number on generation given correct context, and the bad answer stops being a mystery and starts being an address.

What changes once the component scores exist is the conversation. Instead of arguing about which model writes better answers, you point at recall@5 sitting at 55% and the argument is over. The next move after this is to take the same instrumentation past the offline test set and into production, scoring a sample of live traffic so retrieval drift shows up as a falling number on a dashboard rather than as a support ticket. Offline scores tell you the pipeline was sound when you shipped it. Production signals tell you it still is.

RAG Evaluation: Measure the Component, Not the Answer

Written by Thomas Findlay.