Information Retrieval 103: A Field Guide to Diagnostic Patterns

What information do these metrics give us?

We previously touched on some common information retrieval metrics, including both order-blind and order-aware metrics. We understand what they are, but how do they help us identify if our retrieval system is functioning properly?

Each metric exposes you to a certain aspect of your retrieval system's behavior. By identifying our system's performance across these metrics, we can narrow down what part of it needs to be revisited. However, to get a sense of how good/bad our system currently is, we first need to determine what our system's performance would look like under ideal circumstances. This is where oracle baselines come in.

What's an oracle baseline?

An oracle baseline is the value a certain metric will take if your system is performing ideally for a specific query. In other words, all of our optimizations can only take the system to this ceiling. Why do we care about this? It gives us a meaningful baseline to compare our system's performance to.

For example, say you had . Only out of the documents in the set is relevant. Sounds horrible! However, when you look at the query, it turns out that it only has relevant document in the knowledge base. In other words, is the highest score you can get for this. This is your oracle baseline for for this query.

Sample Oracle Baseline for Precision@10

To give another example, say that you have a query that has relevant documents in the knowledge base. However, your retrieved set can only hold documents. Assuming all the documents you retrieve for this query are relevant, you can still only retrieve . As a result, your maximum for this particular query, or your oracle baseline, is only or .

Sample Oracle Baseline for Recall@5

In this case, if the that you arrive at is actually , that may look average when interpreting it on absolute terms, but that's actually great performance as it's of the oracle!

In short, oracle baselines allow us to draw meaningful comparisons from the values of our retrieval metrics.

What are diagnostic patterns?

Besides comparing each metric to its oracle baseline, you can get information about your system's performance by looking at the gap between two metrics. When one metric performs well but another is egregiously low, that gives you more information about what's wrong with your system compared to looking at any individual metric by itself. Here are some examples:

This is not an exhaustive list

There are many diagnostic patterns that exist, each with its own possible cause and fix. The examples below serve as a demonstration of some of these patterns.

High Recall, Low Precision

High Recall, Low Precision Example

In this diagnostic pattern, our system exhibits a high recall but low precision. This means that we're able to pull in a large number of the relevant chunks needed to answer a query, but our retrieved set contains a lot of irrelevant chunks as well.

This could indicate that your retrieved set of size is too large. To fix, reduce .

High Precision, Low Recall

High Precision, Low Recall Example

In this case, our system exhibits a high precision, but low recall. This informs us that our system's retrieved set is mainly filled with relevant documents, but only a small fraction of the relevant documents needed to answer the query were retrieved from the knowledge base.

This is typically attributed to having too small a retrieved set size. To address this, increase .

Tug of war!

You may notice that the two diagnostic patterns we just discussed have contradictory fixes (i.e. increase the retrieved set size vs. decrease the retrieved set size). You need to find the sweet spot that fits best for your use case. However, the issue with this is that different queries require a different number of relevant documents to answer sufficiently.

For example, a question like "What is an LLM?" may only require 1 relevant document to answer, but a question like "Elaborate on all the recent developments in LLM architecture", that may require a large number of relevant documents, each of which discusses a specific recent development.

Although typically, you'd set a flat retrieved set size regardless of the query to keep things simple, more sophisticated approaches like adaptive-k take this into account.

High Recall, Low MRR

High Recall, Low MRR Example

Here, our retrieval system is able to retrieve most of the relevant documents needed to answer a query, but they're buried in the retrieved set by irrelevant documents.

One possible cause for this is a poor calibration of your reranker/scorer, as these are the components behind finalizing the rank your retrieved documents take in the retrieved set. To fix, you need to re-calibrate them.

High MRR, Low MAP

High MRR, Low MAP Example

In this case, our retriever is able to rank a relevant document highly in the retrieved set (i.e. top-of-funnel performance is good), but the remaining relevant documents are buried.

One possible root cause is missing coverage or our retriever simply isn't retrieving enough relevant documents. There are multiple fixes for this, such as raising (like in the second pattern), improving our retriever, or expanding our corpus.

What have we learned?

In this article, we discussed how we can use information retrieval metrics. Each metric is a lens that allows us to observe a different aspect of our system. However, looking at the gap between each metric gives us more information than looking at any metric in isolation. Through the use of oracle baselines and diagnostic patterns, we're able to diagnose the performance of our retrieval system and identify actionable steps to improve it.

Idea

No metric is an island

Happy coding🚀

bs_code