Evaluating RAG Quality: Beyond "Does It Answer the Question?"

Most teams evaluate by asking: does it answer correctly? That question catches less than half the failures. Here's the full evaluation stack.

Key Takeaways: RAG evaluation has at least five distinct layers — retrieval quality, answer faithfulness, answer relevance, context utilization, and business-level metrics — and most teams measure only the last one, poorly. A model can return correct-sounding text while misusing the retrieved context entirely. Evaluation datasets need to exist before you build, not after. The RAGAS framework is a useful starting point for automated evaluation, but it doesn’t replace a clear definition of what “good” means for your specific workflow.

The most common RAG evaluation question is: “Does it answer correctly?” A human reads a few outputs, nods, and declares the system ready. That workflow will miss most production failures.

Human spot-checking catches obvious wrong answers. It misses everything else. A system that retrieves the wrong documents, ignores half of what it retrieved, or answers a slightly different question than the one asked can all score well on human review if the outputs are fluent and plausible. And fluent, plausible, wrong outputs are exactly what well-tuned language models produce.

The Five Layers of RAG Evaluation

A RAG system is two pipelines in sequence: retrieval and generation. Each fails independently. Failure compounds — a retrieval miss that returns plausible-but-wrong context produces a confident, fluent, wrong answer that is very hard to catch by reading the output alone.

Each layer requires its own metric.

Retrieval quality. Did the retrieval pipeline return the relevant documents?

Standard metrics come from information retrieval research: hit rate (did the correct document appear in the top-k results at all?), MRR (Mean Reciprocal Rank — how high did it rank?), and NDCG (Normalized Discounted Cumulative Gain — a more nuanced measure that weights the rank position of each relevant result). These metrics require a labeled dataset — you need to know, for a sample of questions, which documents are the right ones to retrieve.

Retrieval quality is where most RAG systems fail in production, and where most teams do the least evaluation. If the right document isn’t being retrieved, nothing downstream can rescue the answer.

Answer faithfulness. Did the answer actually come from the retrieved context?

This is distinct from whether the answer is correct. A model can generate a faithful answer grounded in the retrieved context even if the context is wrong. It can also generate a correct answer that has nothing to do with the retrieved context — it just knew the answer from training data. Both are problems.

Faithfulness measures whether claims in the answer are traceable to the retrieved documents. Automated approaches use an LLM as a judge: ask it to verify each claim in the answer against the retrieved chunks. It’s imperfect, but it catches the failure mode where the model ignores the context and answers from memory. In enterprise RAG, that failure mode matters a lot — the whole point is to ground the model in your specific data, not in what it learned during pre-training.

Answer relevance. Did the model answer the question that was asked?

This sounds trivial. It isn’t. RAG systems often respond to a related but different question — especially when the retrieved documents are topically relevant but don’t directly address the query. The model, having retrieved something about invoice approval workflows, will answer a question about invoice approval workflows even if the actual question concerned a specific edge case that none of the retrieved chunks addressed.

Relevance can be measured by asking an LLM to independently re-generate the most likely question given only the answer, then measuring the similarity between the re-generated question and the original. High similarity means the answer addressed the question asked. Low similarity is a sign the model answered the question it wished it had been asked.

Context utilization. Did the model use what was retrieved?

A retrieval system can return five relevant chunks. The model might synthesize all five. Or it might anchor on the first chunk and ignore the rest. Context utilization measures how much of the retrieved information the model actually incorporated into its answer.

This matters because selective use of context causes answers that are technically faithful to something retrieved, but miss important qualifications or contradictions in the other chunks. In a contract review context, a model that reads five clauses and synthesizes one is dangerous — especially if the one it ignores is an exception.

Business-level metrics. Did the user find what they needed?

This is where most evaluation efforts start and stop. Task completion rate, user satisfaction, escalation rate, and downstream impact (did the user override the answer? did they repeat the same query?) are all important. But business-level metrics sit at the top of the stack, not the bottom.

A 90% task completion rate can coexist with a retrieval system that only works because users rephrase their questions until they get results. Business-level metrics tell you whether the system is useful. They don’t tell you why it’s failing when it does.

Why the Evaluation Dataset Must Come First

The correct order is: define evaluation criteria, build the dataset, then build the system.

Most teams invert this. They build a prototype, generate some test questions, run them manually, adjust until it looks good, and ship. The test questions generated after building the system are biased toward the happy path — questions the developer already knows the system handles well.

A proper evaluation dataset starts from the use case. You sample actual queries users will run — from support tickets, user interviews, or past manual research requests. You annotate correct answers and correct supporting documents. You include hard cases: ambiguous questions, multi-hop queries that require synthesizing across documents, questions the system should answer “I don’t know” to.

Hard cases are almost never included in post-hoc test sets. That’s the gap that produces systems with impressive demo accuracy and poor production reliability.

Building the evaluation dataset before the system also forces a concrete definition of success. “It should answer product questions accurately” is not a success criterion. “Hit rate@5 ≥ 0.85, faithfulness ≥ 0.90, and user escalation rate ≤ 10% on a 200-question labeled set” is a success criterion. The act of building the dataset forces you to specify what “accurate” means, for which questions, measured how.

This is a pattern that recurs in PoC work more broadly — the methodology that produces reliable PoC results always defines the measurement before running the experiment.

The RAGAS Framework: Useful, Not Sufficient

RAGAS (Retrieval Augmented Generation Assessment) is an open-source library that automates RAG pipeline evaluation. It implements metrics for context recall, faithfulness, answer relevancy, and context precision using LLMs as judges. It’s a reasonable starting point and reduces the friction of building an evaluation harness from scratch.

The caveats matter.

First, RAGAS metrics are automated approximations, not ground truth. The LLM judge can be fooled by fluent-but-wrong answers, especially on domain-specific content the judge model hasn’t encountered. The faithfulness metric struggles with nuanced factual claims where the difference between what the context says and what the answer says is subtle.

Second, RAGAS doesn’t evaluate your retrieval layer directly. It evaluates what came out of the generator, not what came out of the retrieval step independently. Hit rate, MRR, and NDCG require separate tooling and a separate labeled dataset.

Third, RAGAS scores aren’t comparable across systems unless you hold the test set constant. A faithfulness score of 0.82 means nothing in isolation. It means something relative to a baseline on the same question set.

Use RAGAS to automate routine evaluation and catch regressions when the pipeline changes. Don’t treat its output as a certificate of quality.

The Retrieval Layer Is Usually the Bottleneck

When RAG systems underperform, the intuitive explanation is that the LLM is the problem — it’s hallucinating, ignoring context, needs a better prompt. That’s often wrong.

A 768-dimensional cosine similarity score feels like a measure of semantic relevance. In practice it conflates topical similarity with query-specific relevance. A chunk about invoice payment terms and a chunk about invoice due date exceptions might both score high for the query “when are invoices due?” — but only one answers the question.

Retrieval quality degrades in predictable ways: sparse document coverage (the answer isn’t in any chunk), poor chunking decisions (the answer spans two chunks that got split at a boundary), and query-document vocabulary mismatch (the query uses different terminology than the documents). Each has a specific fix. None of them are fixed by changing the LLM.

Before changing any other part of a RAG system, measure the retrieval layer independently. Pull the top-5 retrieved chunks for a sample of queries and read them. Ask: if you had only these five chunks and no other knowledge, could you answer the question correctly? If the answer is often no, you have a retrieval problem.

Building Evaluation In, Not On

The evaluation framework built for a PoC should carry through to production. This doesn’t mean running a full evaluation suite after every change — it means instrumenting the system to capture the data needed for evaluation over time.

Log retrieved chunks alongside queries and answers. Log user feedback signals — escalations, overrides, repeat queries on the same topic. Sample a fraction of interactions for human review on a schedule. Build a regression test that runs a fixed evaluation set on every pipeline change and blocks deployment if key metrics drop below threshold.

An evaluation framework built after the fact, on a live system, is fighting a rearguard action. The bugs are already in production, users have formed opinions, and you’re measuring a moving target. Build it before the first user query, even if it starts small.

The evaluation dataset is a deliverable, not an afterthought. If it isn’t scoped into the project from the start, it won’t exist when you need it.

At Trobz, when we scope a RAG project, the evaluation dataset is part of the delivery. If you’re defining what “good enough” looks like for your use case, we’re happy to share the evaluation setup we use.