Information Retrieval Metrics 101: Recall and Precision
What is information retrieval?
Information Retrieval refers to the process of finding a subset of information from a large corpus of documents that is relevant to a particular query. They're a core piece of Retrieval Augmented Generation (RAG) pipelines, which have exploded in popularity due to their effectiveness in grounding Large Language Model's (LLM's) on proprietary data. To be specific, information retrieval occurs in the retriever of the pipeline. If we have a lousy retriever, the generated answer will be lousy as well. Garbage in, garbage out.
In other words, we need to keep track of our retriever's performance so that we can have confidence that the documents it fetches provide enough context for the LLM to work with to generate a satisfactory answer. To quantify this, we begin with the two most fundamental metrics: recall and precision.
How do we know if retrieval is good?
Let's start with an example. Say our retriever needs to fetch 5 documents () from our knowledge base for a particular query. When it does so, we get 3 relevant documents and 2 irrelevant documents. Let's assume that, for this query, the knowledge base has 8 relevant documents. In this case:
For a more formal definition, we define recall and precision for a retrieved set of size as such:
- - The ratio between the number of relevant documents in and the total number of relevant documents in our knowledge base. It seeks to answer the question "Out of all the documents relevant to answering this query, how many did we retrieve?"
- - The ratio between the number of relevant documents in and . It seeks to answer the question "Out of all the retrieved documents, how many are relevant to the query?"
Although looking at either metric by itself provides information about the performance our retriever, we can gain better insight by looking at the gap between these metrics.
What happens if we have high precision, but low recall?
In this failure mode, the retrieved set is largely composed of relevant documents, but there are a lot more relevant documents in the knowledge base that we failed to retrieve. Our retriever is too selective, pulling in little junk, but also barely pulling in any relevant data.
The consequence of this is that we may not provide enough relevant information to the LLM to generate a sufficient answer. This can be exacerbated with queries that require multiple relevant chunks to formulate a satisfactory answer (e.g. prompts that require reasoning, prompts that ask for comparison between multiple concepts). In order to fix this, we should increase to provide more room for relevant documents to fit, though this also provides opportunity for irrelevant documents to slip in (i.e. decrease in precision).
What happens if we have high recall, but low precision?
In this failure mode, we retrieve a large number of documents from the knowledge base relevant to the query, but we also retrieved a large number of irrelevant documents. Our retriever casts a wide net, enabling it to capture a lot of key information, but it also pulls in a lot of junk.
The consequence of this is that our context window will be polluted with irrelevant data. Since LLM providers charge per token processed, we will end up spending money processing data that we don't actually need for the answer. In order to fix this, we should decrease to provide less room for irrelevant documents to slip in, though this also restricts the amount of relevant documents we can pull in (i.e. decreases recall).
You may notice that optimizing for recall may come at the cost of sacrificing precision, and vice versa.
Initially, it's better to optimize for recall instead of precision. Although you may end up with a lot of noise in your context window, a high recall will increase the likelihood of generating a sufficient answer since the LLM has more relevant data to work with.
What does this look like in code?
With the formal definition set, let's take a look at how this is implemented in code, which is relatively straightforward. We'll be using Python.
1def precision_at_k(retrieved: list[str], ground_truth: dict[str, int], k: int) -> float:2 """3 Compute Precision@k.45 Args:6 retrieved: Ordered list of document IDs (rank 1 first).7 ground_truth: Dict mapping relevant doc IDs to relevance grades (grade > 0 = relevant).8 k: Cutoff position.910 Returns:11 Precision@k as a float between 0.0 and 1.0.12 """1314 if len(ground_truth) == 0:15 return 01617 relevant_docs = sum(18 [1 if ground_truth.get(doc, 0) > 0 else 0 for doc in retrieved[:k]]19 )2021 return relevant_docs / k22
1def recall_at_k(retrieved: list[str], ground_truth: dict[str, int], k: int) -> float:2 """3 Compute Recall@k.45 Args:6 retrieved: Ordered list of document IDs (rank 1 first).7 ground_truth: Dict mapping relevant doc IDs to relevance grades (grade > 0 = relevant).8 k: Cutoff position.910 Returns:11 Recall@k as a float between 0.0 and 1.0.12 """1314 if len(ground_truth) == 0:15 return 01617 relevant_docs = sum(18 [1 if ground_truth.get(doc, 0) > 0 else 0 for doc in retrieved[:k]]19 )20 recall = round(relevant_docs / len(ground_truth), 3)21 return recall2223
This is what the input looks like:
1EVALS_DATASET = [2 {3 "query": "What is the GIL?",4 "retrieved": ["chunk_01", "chunk_04", "chunk_03", "chunk_02", "chunk_12"],5 "ground_truth": {"chunk_12": 3},6 },7 {8 "query": "What is a decorator?",9 "retrieved": ["chunk_07", "chunk_05", "chunk_11", "chunk_02", "chunk_08"],10 "ground_truth": {"chunk_11": 3},11 },12 {13 "query": "How do I sort an object by the value?",14 "retrieved": ["chunk_10", "chunk_07", "chunk_04", "chunk_01", "chunk_05"],15 "ground_truth": {"chunk_10": 3, "chunk_04": 2},16 }17]
It's a list of objects with the following items:
query- The query we seek to answerretrieved- A list of the documents our retriever fetchedground_truthAn object containing the documents that are actually relevant to answering the query (typically manually annotated). For now, don't mind the integer value associated with each chunk. We'll revisit that later!
What do these metrics not tell us?
Although recall and precision are fundamental metrics of information retrieval, they are order unaware. Say we have the following setup:
- System A retrieves 2 relevant documents, placing them first in the retrieved set.
- System B also retrieves 2 relevant documents, but they're scattered in the retrieved set.
Computing the precision and recall for both these systems yield the same results, but retriever A is much better than retriever B. It's important to put relevant context first (or last) since LLM's, especially with large context windows, tend to perform worse if relevant context is buried in the middle of the context window (see here).
This is why we employ other metrics that are order-aware, such as:
- Mean Reciprocal Rank (MRR)
- Mean Average Precision (MAP)
- Normalized Discounted Cumulative Gain (NDCG)
If you want to learn what these are, stay tuned! Will be writing about them in upcoming posts.
Until then, happy coding🚀