Fisherman on a Lake

Information Retrieval Metrics 102: Order-Aware Metrics

Retrieval Augmented GenerationInformation RetrievalLarge Language Models
Published

What are order-aware metrics?

In a previous post on retrieval metrics, we discussed how recall and precision measure how many relevant documents your retrieved set contains. Now, imagine that we have retrievers A and B with the following retrieved set:

Both of these systems will score the same in terms recall and precision, yet the retrieved sets they produce aren't equivalent! Retriever A places the relevant documents in the middle of the retrieved set, while retriever B places them at the start of the retrieved set. What's wrong with this?

The issue is that LLM's suffer from a lost in the middle problem, where placing relevant documents in the middle of the context shows poorer performance. For better performance, relevant documents must be placed at the end or, even better, at the beginning of the retrieved set. In other words, we need to be able to call out our retriever if it doesn't rank relevant documents highly.

We need to look at other metrics that give us insight into how well the retriever ranks relevant documents. These order-aware metrics are:

  1. Mean Reciprocal Rank (MRR)
  2. Mean Average Precision (MAP)
  3. Normalized Discounted Cumulative Gain (NDCG)

What are the standard order-aware metrics?

What is Mean Reciprocal Rank (MRR)?

Prior to defining MRR, we need to define what reciprocal rank is. Reciprocal rank measures how highly the first relevant document ranks in your retrieved set:

  • A high reciprocal rank indicates that your first relevant document ranks highly
  • A low reciprocal rank indicates that your first relevant document ranks lowly

To compute the reciprocal rank of a specific query, we take the reciprocal of the rank of the first document that appears in the retrieved set


Reciprocal Rank Definition

where is the rank of the first relevant document in the retrieved set.

Applying this to our earlier example, we see that Retriever A has an RR of , while Retriever B has an RR of . From this, we can clearly see that Retriever B is the better system.

To arrive at the Mean Reciprocal Rank (MRR), we take the average of the recorded reciprocal rank values across multiple queries.


Mean Reciprocal Rank Definition


Mean Reciprocal Rank measures how well the retriever ranks the first relevant document of our retrieved set. Here's what it looks like in code:

mrr.py
1def reciprocal_rank(retrieved: list[str], ground_truth: dict[str, int]) -> float:
2 """
3 Compute the reciprocal rank for a single query.
4
5 Args:
6 retrieved: Ordered list of document IDs (rank 1 first).
7 ground_truth: Dict mapping relevant doc IDs to relevance grades.
8
9 Returns:
10 1/rank of the first relevant document, or 0.0 if none found.
11 """
12
13 for i in range(len(retrieved)):
14 doc = retrieved[i]
15 if ground_truth.get(doc, 0) > 0:
16 return 1 / (i + 1)
17
18 return 0
19
20
21def mrr(queries: list[dict]) -> float:
22 """
23 Compute Mean Reciprocal Rank across multiple queries.
24
25 Args:
26 queries: List of dicts, each with keys:
27 - "retrieved": list[str] — ordered doc IDs
28 - "ground_truth": dict[str, int] — relevant doc IDs with grades
29
30 Returns:
31 MRR as a float.
32 """
33 rrs = []
34
35 for query in queries:
36 retrieved = query["retrieved"]
37 ground_truth = query["ground_truth"]
38 rr = reciprocal_rank(retrieved, ground_truth)
39 rrs.append(rr)
40
41 return sum(rrs) / len(rrs)
42

What is Mean Average Precision (MAP)?

Average Precision measures how highly all of your retrieved relevant documents rank in the retrieved set:

  • A high average precision indicates that your retriever is able to rank relevant documents highly
  • A low average precision indicates that your retriever buries relevant documents under irrelevant ones

Computing the average precision is a bit more complicated. We need to:

  1. Compute from to if the document at rank is relevant
  2. Take the average of these precision values. In this case, the denominator is the total number of relevant documents in the retrieved set
Average Precision Definition

where is the total number of relevant documents in the retrieved set.

Note

There is an alternative definition of average precision where you divide by the total number of relevant documents in the knowledge base instead of the retrieved set. This will penalize the retriever for failing to retrieve relevant documents.

For this article, however, we'll be using the definition provided above.

Applying this to our example earlier, we get the average precision for retrievers A and B:

  • Retriever A:
  • Retriever B:

Notice that, once again, retriever B is considered better as it ranks all relevant documents highly in the retrieved set.

To compute the Mean Average Precision, just take the average of the average precisions (confusing, I know) measured across multiple queries.


Mean Average Precision Formula

Here's what it looks like in code:

map.py
1def average_precision(retrieved: list[str], ground_truth: dict[str, int]) -> float:
2 """
3 Compute Average Precision (AP) for a single query.
4
5 Args:
6 retrieved: Ordered list of document IDs (rank 1 first).
7 ground_truth: Dict mapping relevant doc IDs to relevance grades.
8
9 Returns:
10 AP as a float. Denominator is total relevant docs in ground_truth (corpus-level).
11 """
12
13 if len(ground_truth) == 0:
14 return 0
15
16 precisions = []
17 for i in range(len(retrieved)):
18 if retrieved[i] not in ground_truth.keys():
19 continue
20 precision = precision_at_k(retrieved[: i + 1], ground_truth, i + 1)
21 precisions.append(precision)
22
23 average_precision = sum(precisions) / len(ground_truth)
24 return average_precision
25
26
27def mean_average_precision(queries: list[dict]) -> float:
28 """
29 Compute Mean Average Precision (MAP) across multiple queries.
30
31 Args:
32 queries: List of dicts, each with keys:
33 - "retrieved": list[str] — ordered doc IDs
34 - "ground_truth": dict[str, int] — relevant doc IDs with grades
35
36 Returns:
37 MAP as a float.
38 """
39
40 aps = []
41
42 for query in queries:
43 retrieved = query["retrieved"]
44 ground_truth = query["ground_truth"]
45
46 ap = average_precision(retrieved, ground_truth)
47 aps.append(ap)
48
49 map = sum(aps) / len(aps)
50 return map
51

What is Normalized Discounted Cumulative Gain (NDCG)?

Imagine you have the following setup:

  • Search query - "How do I integrate Better Auth into Next.js?"
  • Relevant documents retrieved:
    1. A blog post introducing Better Auth
    2. Technical documentation discussing how to use Better Auth with Next.js
    3. A StackOverflow thread discussing a bug when integrating Better Auth with Next.js and how to resolve it

Although all of these documents are relevant, they aren't equally so. The technical documentation is the most relevant, relatively speaking, since it directly answers the query. The other two documents discuss the same concepts but don't answer the query as directly.

We quantify this directness by assigning a grade. For example, we have a grading system from 0 to 3, where 0 is irrelevant and 3 is extremely relevant. Our documents above would then be graded like so:

  1. Document A - 1. Mentions what Better Auth is, but provides no direct answer to the query.
  2. Document B - 3. Directly answers the query by providing documentation on the integration
  3. Document C - 2. Talks about a certain aspect of Better Auth integrating with Next.js, but not as much as Document B.

How does this relate to retrieval? Ideally, we want our retriever to rank the documents with the highest grades first as these are the most important in answering a particular query. To quantify this performance, we introduce Discounted Cumulative Gain (DCG).

DCG measures how organized your relevant documents are in the retrieved set:

  • High DCG indicates that your relevant documents are ranked highly and in order
  • Low DCG indicates that your relevant documents are buried and/or disorganized (i.e. less relevant documents rank higher than more relevant ones)

To compute DCG for a particular query, we:

  1. Take a document's graded relevance and divide it by the natural log of that document's position + 1
  2. Take the sum of these values for each document


Discounted Cumulative Gain Formula
Note

Why do we divide by the logarithm of the position instead of just the position itself?


We do so to decrease the penalization for lower ranked documents. Intuitively, the difference between placing a document at rank 1 or rank 2 should matter more than placing a document at rank 11 or rank 12. This denominator value reflects this intuition.

Say that this is how our retriever returns the relevant documents:

Our DCG for this system would be:



How do we know if this is a good value for DCG? Let's think about it. If your retriever was performing ideally, it would rank the documents with a higher relevancy grade higher in the retrieved set:

If this were the case, the ideal DCG would be:


To identify how close we are to this ideal, we define the Normalized Discounted Cumulative Gain or NDCG, which is just the ratio of DCG and IDCG


Normalized Discounted Cumulative Gain Formula

For this retriever and on this particular query, we have the following NDCG


In other words, our retriever is currently performing at 71% of the ideal.

The NDCG also allows us to compare the DCG performance across queries, since different queries have different IDCG's due to each query having its own ground truth relevant documents. By computing NDCG and normalizing this metric, we allow for direct comparison across multiple queries!

Here's what the code looks like:

ndcg.py
1def dcg_at_k(retrieved: list[str], ground_truth: dict[str, int], k: int) -> float:
2 """
3 Compute DCG@k using the exponential gain variant: (2^rel - 1) / log2(i + 1).
4
5 Args:
6 retrieved: Ordered list of document IDs (rank 1 first).
7 ground_truth: Dict mapping doc IDs to relevance grades.
8 Documents not in ground_truth have grade 0.
9 k: Cutoff position.
10
11 Returns:
12 DCG@k as a float.
13 """
14
15 dcg = 0
16
17 for i in range(len(retrieved[:k])):
18 doc = retrieved[i]
19 if doc not in ground_truth.keys():
20 continue
21
22 dcg += (2 ** ground_truth[doc] - 1) / log(i + 2, 2)
23
24 return dcg
25
26
27def ndcg_at_k(retrieved: list[str], ground_truth: dict[str, int], k: int) -> float:
28 """
29 Compute NDCG@k = DCG@k / IDCG@k.
30
31 Args:
32 retrieved: Ordered list of document IDs (rank 1 first).
33 ground_truth: Dict mapping doc IDs to relevance grades.
34 k: Cutoff position.
35
36 Returns:
37 NDCG@k as a float between 0.0 and 1.0. Returns 0.0 if IDCG is 0.
38 """
39
40 dcg = dcg_at_k(retrieved, ground_truth, k)
41
42 relevance_mapping: dict[str, int] = {}
43
44 for doc in retrieved:
45 relevance = -1
46
47 if ground_truth.get(doc, 0) > 0:
48 relevance = ground_truth[doc]
49
50 relevance_mapping[doc] = relevance
51
52 ideal_mapping = dict(
53 sorted(relevance_mapping.items(), key=lambda item: item[1], reverse=True)
54 )
55 ideal_retrieved = list(ideal_mapping.keys())
56
57 idcg = dcg_at_k(ideal_retrieved, ground_truth, k)
58
59 if idcg == 0:
60 return 0
61
62 return dcg / idcg
63

What did we learn?

Precision and recall allow us to quantify how many relevant documents we're able to retrieve, but they fail to quantify how well it ranks these documents. We care about ordering since an LLM tends to perform better when relevant documents is placed at the start of the context window. Thus, we need order-aware metrics.

The common metrics are:

  • Mean Reciprocal Rank (MRR) - Measures how high the first retrieved relevant document ranks
  • Mean Average Precision (MAP) - Measures how high all retrieved relevant documents rank
  • Normalized Discounted Cumulative Gain (NDCG) - Measures how ordered the relevant documents are in terms of graded relevance

Each of these metrics is a lens that lets us look at a certain aspect of our retriever. However, they are not to be taken in isolation. More information can be gleaned from the gap between metrics than looking at individual metrics alone. In the next article, we'll look into some common diagnostic patterns to view these metrics in action.


Last updated

Like

Related Posts