I'm using python for a learning-to-rank problem, and I am evaluating my success using the following DCG and NDCG code (from http://nbviewer.ipython.org/github/ogrisel/notebooks/blob/master/Learning%20to%20Rank.ipynb )
def dcg(relevances, rank=20):
relevances = np.asarray(relevances)[:rank]
n_relevances = len(relevances)
if n_relevances == 0:
return 0.
discounts = np.log2(np.arange(n_relevances) + 2)
return np.sum(relevances / discounts)
def ndcg(relevances, rank=20):
best_dcg = dcg(sorted(relevances, reverse=True), rank)
if best_dcg == 0:
return 0.
return dcg(relevances, rank) / best_dcg
Here are the DCG values for the best and worst case scenarios in a list of 3 items with no duplicate ranks...
>>> ndcg(np.asarray([3,2,1]))
1.0
>>> ndcg(np.asarray([1,2,3]))
0.78999800424603583
We can use this metric to compare the two rankings and see which is better. If I calculate the worst case for a 4 item list, however...
>>> ndcg(np.asarray([1,2,3,4]))
0.74890302967841715
The 4 item list no longer seems comparable to the 3 item list.
I have also calculated two alternative NDCGs. NDCG2 compares the achieved dcg to bot the best and worst case...
def ndcg2(relevances, rank=20):
best_dcg = dcg(sorted(relevances, reverse=True), rank)
worst_dcg=dcg(sorted(relevances, reverse=False),rank)
if best_dcg == 0:
return 0.
return (dcg(relevances, rank)-worst_dcg) / (best_dcg-worst_dcg)
NDCG randomizes my list of actual rankings 50 times, calculates dcg for each, and compares that to my actual DCG.
def ndcg3(relevances, rank=20):
shuffled=np.copy(relevances)
rands=[]
for i in range(50):
np.random.shuffle(shuffled)
rands.append(dcg(shuffled,rank))
avg_rand_dcg=np.mean(np.asarray(rands))
return dcg(relevances, rank) / avg_rand_dcg
Across my various lists, I get the following metrics...
- NDCG: average is .87 (sounds good)
- Spearman rank: around .25 (not amazing, but there is something there)
- NDCG2: .58 (on average, slightly closer to the best dcg than it is to the worst)
- NDCG3: 1.04 (slightly better than randomly sorted lists)
I honestly can't make heads or tails of these results. My NDCG values seem good, but are they really comparable across lists? Do the alternative metrics make more sense?
edit: In my first random comparison, I was not using np.copy(). As such, my random score was almost always .99. That is now fixed and results make more sense.