Is NDCG (normalized discounted gain) flawed? I have calculated a few alternative ranking quality measures, and I can't make heads or tails of it

Question

I'm using python for a learning-to-rank problem, and I am evaluating my success using the following DCG and NDCG code (from http://nbviewer.ipython.org/github/ogrisel/notebooks/blob/master/Learning%20to%20Rank.ipynb )

def dcg(relevances, rank=20):
    relevances = np.asarray(relevances)[:rank]
    n_relevances = len(relevances)
    if n_relevances == 0:
        return 0.
    discounts = np.log2(np.arange(n_relevances) + 2)
    return np.sum(relevances / discounts)

def ndcg(relevances, rank=20):
    best_dcg = dcg(sorted(relevances, reverse=True), rank)
    if best_dcg == 0:
        return 0.
    return dcg(relevances, rank) / best_dcg

Here are the DCG values for the best and worst case scenarios in a list of 3 items with no duplicate ranks...

>>> ndcg(np.asarray([3,2,1]))
1.0
>>> ndcg(np.asarray([1,2,3]))
0.78999800424603583

We can use this metric to compare the two rankings and see which is better. If I calculate the worst case for a 4 item list, however...

>>> ndcg(np.asarray([1,2,3,4]))
0.74890302967841715

The 4 item list no longer seems comparable to the 3 item list.

I have also calculated two alternative NDCGs. NDCG2 compares the achieved dcg to bot the best and worst case...

def ndcg2(relevances, rank=20):
    best_dcg = dcg(sorted(relevances, reverse=True), rank)
    worst_dcg=dcg(sorted(relevances, reverse=False),rank)
    if best_dcg == 0:
        return 0.
    return (dcg(relevances, rank)-worst_dcg) / (best_dcg-worst_dcg)

NDCG randomizes my list of actual rankings 50 times, calculates dcg for each, and compares that to my actual DCG.

def ndcg3(relevances, rank=20):
    shuffled=np.copy(relevances)
    rands=[]
    for i in range(50):
        np.random.shuffle(shuffled)
        rands.append(dcg(shuffled,rank))
    avg_rand_dcg=np.mean(np.asarray(rands))
    return dcg(relevances, rank) / avg_rand_dcg

Across my various lists, I get the following metrics...

NDCG: average is .87 (sounds good)
Spearman rank: around .25 (not amazing, but there is something there)
NDCG2: .58 (on average, slightly closer to the best dcg than it is to the worst)
NDCG3: 1.04 (slightly better than randomly sorted lists)

I honestly can't make heads or tails of these results. My NDCG values seem good, but are they really comparable across lists? Do the alternative metrics make more sense?

edit: In my first random comparison, I was not using np.copy(). As such, my random score was almost always .99. That is now fixed and results make more sense.

Is the value in your `relevances` array binary or continuous? also when you evaluate the dcg for a `relevances` vector at rank of k, how many elements (n) are in the vector? is n< — greeness, Oct 01 '14 at 23:38
I'm playing with a few different options, some of which seem unusual for DCG. a) Use continuous variables for both ranks and relevances, b) use array.argsort().argsort() to convert continuous variables into rankings (highest rank is equal to length of array), c) bucket items into 5 relevance categories and compare them to predicted relevance categories (multiple elements will have the same relevance). — neelshiv, Oct 02 '14 at 12:55

Clément Vgnc · Answer 1 · 2017-05-18T09:07:47.210

One think that may mislead you is the way to normalize NDCG. Usually, you have a number of documents to rank, but your NDCG is truncated at a smaller number of documents (for example NCDG@3). In your code, this is determined by the parameter 'rank'.

Let's say that you want to rank 5 documents with relevances R = [1, 2, 3, 4, 0], and compute NDCG@3. If your algorithm believes that the optimal order is [doc1, doc2, doc3, doc4, doc5], then you will have:

NDCG@3 = DCG([1, 2, 3]) / DCG([4, 3, 2])

and not

NDCG@3 = DGC([1, 2, 3]) / DCG([3, 2, 1])   # Incorrect

So in a sense, NDCG([1, 2, 3]) and NDCG([1, 2, 3, 4]) are not comparable. The numerators are quite the same, but the denominators are completely different. If you want NDCG to have an intuitive meaning, you have to set 'rank' smaller or equal to your number of documents.

Is NDCG (normalized discounted gain) flawed? I have calculated a few alternative ranking quality measures, and I can't make heads or tails of it

1 Answers1