I'm currently evaluating a recommender system based on implicit feedback. I've been a bit confused with regard to the evaluation metrics for ranking tasks. Specifically, I am looking to evaluate by both precision and recall.
Precision@k has the advantage of not requiring any estimate of the size of the set of relevant documents but the disadvantages that it is the least stable of the commonly used evaluation measures and that it does not average well, since the total number of relevant documents for a query has a strong influence on precision at k
I have noticed myself that it tends to be quite volatile and as such, I would like to average the results from multiple evaluation logs.
I was wondering; say if I run an evaluation function which returns the following array:
Numpy array containing precision@k scores for each user.
And now I have an array for all of the precision@3 scores across my dataset.
If I take the mean of this array and average across say, 20 different scores: Is this equivalent to Mean Average Precision@K or MAP@K or am I understanding this a little too literally?
I am writing a dissertation with an evaluation section so the accuracy of the definitions is quite important to me.