Retrieval-Augmented Generation (RAG) emerged as a dominant framework for feeding Large Language Models (LLMs) the context beyond the scope of their training data and enabling LLMs to respond with more grounded answers and fewer hallucinations based on that context.
However, designing an effective RAG pipeline can be challenging. You need to answer questions such as:
- How should you parse and chunk text documents for vector embedding? What chunk size and overlay size should you use?
- What vector embedding model should you use?
- What retrieval method should I use to fetch the relevant context? How many documents should you retrieve by default? Does the retriever actually manage to retrieve the applicable documents?
- Does the generator actually generate content that is in line with the retrieved context? What parameters (model, prompt template, temperature) work better?
The only way to objectively answer these questions is to measure how well the RAG pipeline works, but what exactly do you measure, and how do you measure it? This is the topic I’ll cover here.
Typical RAG pipeline
A typical RAG pipeline is made up of three separate phases: ingestion, retrieval, and generation.
In the ingestion phase, you split documents into smaller text chunks (with a certain chunk size and overlay size), use an embedding model to convert those text chunks into vectors (numerical representations of the text), and save them into a vector database. This is typically a one-time operation (unless you need to change one of the parameters, such as the embedding model or chunk size).
In the retrieval phase, the retriever part of the RAG pipeline takes the user’s question, uses the same embedding model to convert it into a vector, and performs a similarity search against the text chunks in the vector database. This retrieves the top K similar text chunks that get aggregated into a context for the generation phase.
In the generation phase, the generator takes the context from the retrieval phase and generates a response to the user using the supplied parameters, such as the LLM, prompt template, and temperature.

I’ve seen two approaches in measuring the effectiveness of the RAG pipeline.
Approach 1: Evaluating Ingestion+Retrieval and Generation separately
In this approach, you evaluate the RAG pipeline’s ingestion + retrieval and generation phases separately using their separate metrics.

For the ingestion+retrieval phase, you need to measure how effectively you’re storing and retrieving relevant context. These are the helpful metrics:
- Context Relevance evaluates the relevance of the retrieved context for the given input. A low score usually indicates a problem with how the text is chunked, embedded, and retrieved.
- Context Recall evaluates how many of the relevant documents were successfully retrieved. It focuses on ensuring important results are included. Higher recall means fewer relevant documents were left out.
- Context Precision evaluates whether nodes in the retrieved context that are relevant to the given input are ranked higher than irrelevant ones.
Note that both context recall and context precision require an expected output (the ideal answer to a given input) to compare against.
For the generation phase, you need to determine how effective the generator is in generating relevant and grounded output with the supplied context. These are the metrics:
- Faithfulness / Groundedness evaluates whether the actual output factually aligns with the retrieved context. If this score is low, it usually points to a problem in your model. Maybe you need to try a better model or fine-tune your own model to get more grounded answers based on the retrieved context.
- Answer relevance evaluates how relevant the actual output is to the given input. If this score is low, it usually points to a problem in your prompt. Maybe you need better prompt templates or better examples in your prompts to get more relevant answers.
This approach allows pinpoint issues in the ingestion+retrieval or generation phases. You can determine whether the retriever is failing to retrieve the correct and relevant context or whether the generator is hallucinating despite being provided the right context.
As already mentioned, both context recall and context precision require an expected output. This might not be possible to determine upfront. That’s why the RAG triad emerged as the alternative referenceless RAG evaluation method.
Approach 2: RAG Triad
In this approach, you evaluate the RAG pipeline as a whole using the subset of the RAG evaluation metrics: answer relevance, faithfulness, and context relevance.

Since context precision and context recall are not part of the RAG triad, this allows evaluations without expected outputs or a reference to compare against.
Evaluation Frameworks
At this point, you might be wondering: How do I implement the RAG triad? Ideally, you use an LLM evaluation framework. There are many out there. These are the ones I used:
- DeepEval has been my go-to LLM evaluation framework. It provides the metrics needed for the RAG triad and a guide to it.
- TruLens is another LLM evaluation framework with a RAG triad guide.
- Ragas has an extensive list of metrics, and it can be used for the RAG triad.
Do you know any other good evaluation frameworks? If so, let me know!
Conclusion
In this blog post, I explored a couple of different approaches to evaluating RAG pipelines and introduced the RAG triad metrics. If you’re looking for sample implementations, I have some DeepEval RAG evaluation samples in my GitHub repo.