19

I'm currently evaluating a recommender system based on implicit feedback. I've been a bit confused with regard to the evaluation metrics for ranking tasks. Specifically, I am looking to evaluate by both precision and recall.

Precision@k has the advantage of not requiring any estimate of the size of the set of relevant documents but the disadvantages that it is the least stable of the commonly used evaluation measures and that it does not average well, since the total number of relevant documents for a query has a strong influence on precision at k

I have noticed myself that it tends to be quite volatile and as such, I would like to average the results from multiple evaluation logs.

I was wondering; say if I run an evaluation function which returns the following array:

Numpy array containing precision@k scores for each user.

And now I have an array for all of the precision@3 scores across my dataset.

If I take the mean of this array and average across say, 20 different scores: Is this equivalent to Mean Average Precision@K or MAP@K or am I understanding this a little too literally?

I am writing a dissertation with an evaluation section so the accuracy of the definitions is quite important to me.

apgsov
  • 794
  • 1
  • 8
  • 30
  • See if this helps - https://www.kaggle.com/nandeshwar/mean-average-precision-map-k-metric-explained-code – Nandesh Feb 18 '22 at 11:48

1 Answers1

41

There are two averages involved which make the concepts somehow obscure, but they are pretty straightforward -at least in the recsys context-, let me clarify them:

P@K

How many relevant items are present in the top-k recommendations of your system


For example, to calculate P@3: take the top 3 recommendations for a given user and check how many of them are good ones. That number divided by 3 gives you the P@3

AP@K

The mean of P@i for i=1, ..., K.


For example, to calculate AP@3: sum P@1, P@2 and P@3 and divide that value by 3

AP@K is typically calculated for one user.

MAP@K

The mean of the AP@K for all the users.


For example, to calculate MAP@3: sum AP@3 for all the users and divide that value by the amount of users

If you are a programmer, you can check this code, which is the implementation of the functions apk and mapk of ml_metrics, a library mantained by the CTO of Kaggle.

Hope it helped!

dataista
  • 3,187
  • 1
  • 16
  • 23
  • 3
    It's worth noting that typically, when computing AP@K, one only averages over values of k at which a relevant recommendation is made. That is what is being done in the code linked, it is also made clear [here](https://web.stanford.edu/class/cs276/handouts/EvaluationNew-handout-6-per.pdf). – A Person Jul 01 '20 at 17:27
  • 1
    "For example, to calculate P@3: take the top 3 recommendations for a given user and check how many of them are good ones." - Ambiguous answer. How do you determine what "good ones" are? – alex Aug 13 '20 at 19:30
  • @Alex As far I knew, we will find the best value of K by carrying any/many Hyperparameter tuning methods. For example-: ALS, SGD, SVD,etc. Please do correct if I’m wrong. – Purushothaman Srikanth Dec 19 '20 at 13:07
  • 2
    @alex 'good ones' are those that are **relevant**. There are a few ways to decide whether an item is relevant or not. For example, if the rating scale is 1-5, perhaps the relevant threshold can be set as <3, so any items that are scored <3 are deemed 'irrelevant' as they are likely to be items that the user 'dislikes' . However, such a constant threshold is biased as some users might be more prone to giving higher or lower ratings consistently. Therefore, an alternative is to get the mean of each user, and use that as the threshold. Items that are >= this mean threshold are 'relevant'. – TYL Feb 13 '21 at 08:31
  • 1
    The claim that AP@K = mean(P@1 + P@2 + ... P@K) is false. [Here's a counterexample](https://www.practiceprobs.com/problemsets/evaluation-metrics-and-loss-functions/precision-and-recall/misconception/). – Ben Feb 18 '22 at 16:55
  • The definition of `AP@k` is not correct, please check [this](https://sdsawtelle.github.io/blog/output/mean-average-precision-MAP-for-recommender-systems.html#:~:text=Examples%20and%20Intuition%20for%20AP%C2%B6) explanation – 3nomis Mar 25 '22 at 10:57