Grounded Question Answering (QA) is usually the last step of a RAG pipeline: given a question and a set of documents retrieved from the corpus, an LLM must generate an answer. We expect the LLM to cite which document each piece of information is coming from, as depicted below. When no precise answer is in the documents, the LLM should indicate it in its answer. In that case, if some related information is available in the documents, the LLM can add it to the answer to show the corpus is not completely off-topic with respect to the question.
This task is difficult to evaluate due to the wide variety of errors an answer can contain, such as superfluous information, missing relevant details from references, incorrect claims that no document answers the question, citation mistakes and so on. Some attempts to define metrics and automatize the evaluation of this task have been made (RAGAS, DeepEval), however, these approaches didn't cover all the failure modes we were interested in.
Most of these approaches rely heavily on the LLM-as-a-Judge method. While this technique can be powerful, it is crucial to first assess the ability of an LLM to accurately evaluate this Grounded QA task with respect to our metrics.
It is tempting to consider using an LLM to verify the evaluations generated by an evaluator LLM (which was assessing LLM's answers). However, this could quickly lead down a rabbit hole of endless AI-on-AI evaluations. This is why we developed GroUSE: a unit testing suite designed to evaluate the evaluators (pronounced "graouse").
GroUSE (Grounded QA Unitary Scoring of Evaluators) is a dataset of unitary tests used to check if a Grounded QA evaluator is giving the scores we expect. Each test contains:
In our framework, judge LLMs evaluate the quality of a grounded QA answer according to 6 metrics intended to capture all the failure modes of the task:
The GroUSE dataset comprises 144 samples organized into 9 sets. Each set addresses the same question and draws from largely similar references, with slight variations in the answers. These small modifications are tailored to fit a predefined typology of 16 test types, which are designed to assess whether an evaluator correctly penalizes all failure modes and rewards accurate answers across a diverse range of scenarios. The image below displays four samples along with their corresponding test types. For instance, test type 14 assesses whether the faithfulness score is set to 0 when there is a citation mistake.
GroUSE includes an additional set of tests meant to help users engineer their prompts and try to obtain the best evaluator possible before checking its performances on the 9 other sets. Using this "train set", we iterated on the prompts, making our best effort to craft the best prompts possible for each of the tested models before measuring how many tests they passed. The "train set" is kept small to imitate the real-world scenario where the user has a limited number of samples to optimize its prompts.
The structure of the GroUSE dataset allows for presenting a model's results in a matrix format, where each row represents the model's performance on a specific test type, and each column corresponds to its performance on a particular question. This format reveals, for example, that GPT-4 struggles with test type 16, which involves an answer containing information that distorts one of the references, leading to a low expected faithfulness but good relevancy and good completeness. Moreover, Llama-3 70B struggles the most with test type 7, a test in which we include an *absurd* fact in the references and mention this fact in the answer. Despite the fact seeming incorrect, since it's present in the references, high scores are expected. Test type 7 allows to check that the model doesn't use its internal knowledge and refers solely to the references to evaluate the metrics.
For a more compact view, we can also calculate the percentage of tests each model passes for each metric:
The strongest evaluator models are GPT-4 for closed-weights models, with a pass rate of 95%, and Llama-3 70b for open-weights with 79%. The human performance on this dataset is 98%. The hardest metric to evaluate is completeness, for LLMs and humans alike.
To demonstrate the gap between open-weights and closed-weights models can be narrowed, we finetuned a Llama-3 8b model on traces of evaluations by GPT-4. Aiming to develop a model capable of solving the task in a single call, we concatenated the metric-specific responses from GPT-4 into a single output and followed a similar process for the input, resulting in a dataset of 1200 samples. We finetuned the Llama-3 8b on 1k samples of this dataset, and used the rest as a test set. We measured the model's progression both on GroUSE and by measuring the correlation between GPT-4's grades and the finetuned model's grades on the test set.
Finetuning significantly enhances the evaluation capabilities of Llama-3, as evidenced by the substantial improvement in pass rates, going from a 40% to a 83% test pass rate. A similar progress can be seen on the correlation measures, however it is worth noting that the finetuned model has similar correlation levels than the 0-shot Llama-3 8b with evaluating one metric per prompt. Although this approach demonstrated significant improvements, it would be beneficial to explore the effects of finetuning larger models, which could potentially yield even better performance.
Our results reveal a discrepancy between GroUSE pass rates and correlation with GPT-4's grades. While Prometheus 2 7b and finetuned Llama-3 8b show similar correlations with GPT-4 on answer relevancy, their GroUSE pass rates differ significantly, with Llama-3 8b outperforming Prometheus 2 7b. Confusion matrices reveal that Prometheus 2 has better overall agreement with GPT-4 but struggles with extreme cases (1, 5 and NaN cases), while finetuned Llama-3 excels in extreme cases but lacks correlation in intermediate ones.
This finding suggests that a high correlation with GPT-4's judgments does not necessarily equate to a high unit test pass rate. A judge model can share the same relative preferences as GPT-4 (indicated by strong rank correlation) but still lack the same calibration on precise reference cases (very good answers, subtle mistakes, etc.), resulting in poor performance on judgment unit tests.
To conclude briefly:
If you want to evaluate your RAG pipeline with our GPT-4 prompts, or even meta-evaluate your RAG evaluator on GroUSE, a python package is available at github.com/illuin-tech/grouse !
@misc{muller2024grouse,
title={GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering},
author={Sacha Muller and António Loison and Bilel Omrani and Gautier Viaud},
year={2024},
eprint={2409.06595},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.06595},
}