some ideas and direction of how to measure ranking, AP, MAP, recall for IR evaluation

Question

I have question about how to evaluate the information retrieve result is good or not such as calculate

the relevant document rank, recall, precision ,AP, MAP.....

currently, the system is able to retrieve the document from the database once the users enter the query. The problem is I do not know how to do the evaluation.

I got some public data set such as "Cranfield collection" dataset link it contains

1.document 2.query 3.relevance assesments

             DOCS   QRYS   SIZE*
Cranfield   1,400    225    1.6

May I know how to use do the evaluation by using "Cranfield collection" to calculate the relevant document rank, recall, precision ,AP, MAP.....

I might need some ideas and direction. not asking for how to code the program.

Wasi Ahmad · Accepted Answer · 2016-11-27T22:42:26.777

Document Ranking

Okapi BM25 (BM stands for Best Matching) is a ranking function used by search engines to rank matching documents according to their relevance to a given search query. It is based on the probabilistic retrieval framework. BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of the inter-relationship between the query terms within a document (e.g., their relative proximity). See the Wikipedia page for more details.

Precision and Recall

Precision measures "of all the documents we retrieved as relevant how many are actually relevant?".

Precision = No. of relevant documents retrieved / No. of total documents retrieved

Recall measures "Of all the actual relevant documents how many did we retrieve as relevant?".

Recall = No. of relevant documents retrieved / No. of total relevant documents

Suppose, when a query "q" is submitted to an information retrieval system (ex., search engine) having 100 relevant documents w.r.t. the query "q", the system retrieves 68 documents out of total collection of 600 documents. Out of 68 retrieved documents, 40 documents were relevant. So, in this case:

Precision = 40 / 68 = 58.8% and Recall = 40 / 100 = 40%

F-Score / F-measure is the weighted harmonic mean of precision and recall. The traditional F-measure or balanced F-score is:

F-Score = 2 * Precision * Recall / Precision + Recall

Average Precision

You can think of it this way: you type something in Google and it shows you 10 results. It’s probably best if all of them were relevant. If only some are relevant, say five of them, then it’s much better if the relevant ones are shown first. It would be bad if first five were irrelevant and good ones only started from sixth, wouldn’t it? AP score reflects this.

Giving an example below:

AvgPrec of the two rankings:

Ranking#1: (1.0 + 0.67 + 0.75 + 0.8 + 0.83 + 0.6) / 6 = 0.78

Ranking#2: (0.5 + 0.4 + 0.5 + 0.57 + 0.56 + 0.6) / 6 = 0.52

Mean Average Precision (MAP)

MAP is mean of average precision across multiple queries/rankings. Giving an example for illustration.

Mean average precision for the two queries:

For query 1, AvgPrec: (1.0+0.67+0.5+0.44+0.5) / 5 = 0.62

For query 2, AvgPrec: (0.5+0.4+0.43) / 3 = 0.44

So, MAP = (0.62 + 0.44) / 2 = 0.53

Sometimes, people use precision@k, recall@k as performance measure of a retrieval system. You should build a retrieval system for such testings. If you want to write your program in Java, you should consider Apache Lucene to build your index.

can I ask you one question about how to rank the BM25 score. by using TF-IDF method, we calculate the IT-IDF of document and query , and comparing the cosine distance to rank the document. But how to do the ranking for BM25?? for example, I got the BM25 score for one document ( the scores is_____: [0, -0.00993319335279988, 0.1712756703100223, -0.10833186147108911, -0.08897894166003212, 0.13457374095787467, 1.642922484773619, 0.15791141726235663, 1.0831388761516576] ) How to use the BM25 score to do the ranking ?? — dd90p, Nov 28 '16 at 03:44
ranking is done in its usual way that means documents with higher score will rank higher and vice versa. tf-idf is useful for similarity but BM25 is useful to score documents based relevancy between query and documents. see wikipedia page of BM25 to know more about the function. BM25 considers a lot of thing while computing similarity. — Wasi Ahmad, Nov 28 '16 at 03:58
ok, thanks a lot. I get the ideas. after the ranking process how to identify the which document is relevant and which one is irrelevant to the query. do we need to make assumption that the top first 3 in the rank list is relevant and the others are irrelevant??? In order to calculate the recall, and precision. We need to know the number of relevant document and irrelevant document. So how to identify?? — dd90p, Nov 28 '16 at 04:09
best way to thank is through accepting the answer :) btw, you asked a very good question. you actually need such a dataset for that. I previously used `AOL search query log` dataset to do my research experiment. Since, you are novice, i encourage you to look into this assignment problem (http://www.cs.virginia.edu/~hw5x/Course/IR2015/_site/mps/2015/11/12/mp3/). I solved it when i took this course and there is a small dataset for experiment. It will help you to understand relevant concepts. — Wasi Ahmad, Nov 28 '16 at 05:45

Alikbar · Answer 2 · 2016-11-26T10:54:55.713

1

calculating precision and recall is simple; Precision is the fraction of relevant retrieved documents to all the documents that you retrieved. Recall is the fraction of relevant documents retrieved to all relevant documents.

For example if a query has 20 relevant documents, and you retrieved 25 documents that only 14 of them is relevant to the query, then : Precision = 14/25 and Recall = 14/20.

But precision and recall should be combined in a way, that way is called F-Measure and is harmonic mean of precision and recall: F-Score = 2*Precision*Recall/Precision+Recall .

AP tells you the proportion of relevant documents to irrelevant documents in a specific number of retrieved documents. Assume you retrieved 25 documents and in the first 10 documents, 8 relevant documents are retrieved. So AP(10) = 8/10;

If you calculate and add AP for 1 to N, then divide it by N, you just calculated MAP. Where N is the total number of relevant documents in yoyr data set.

edited Nov 26 '16 at 10:54

answered Nov 26 '16 at 08:49

Alikbar

685
6
20

In my case, I do not know how many documents is relevant to a query. The documents that the program returned, the dataset I used is not labeled that which document is relevant to which query. So , how to measure the relevance between the query and document?? – dd90p Nov 27 '16 at 12:46
Ofcourse they are labeled with query-document relevancy tags. Just look at your dataset again and read the readme file. This is a important part of it: The qrels are in three columns: the first is the query number, the second is the relevant document number, and the third is the relevancy code. The codes are defined in the readme file. – Alikbar Nov 27 '16 at 12:55
AS u mentioned, the "cranqrel" has the qrels , document number, relevancy. However, not all document is labeled in the "cranqrel". for example, there are total 1000 document and 100 qrel in the cran dataset. for qrel id=74, only the document 576,656,575,317,574,578,541 is labeled with the relevancy. In the case that my search system retrieve the document 222,333,444 , but the "cranqrel" did not have the the relevancy. How to do the evaluation?? – dd90p Nov 28 '16 at 02:04
That is a problem of your algorithm. In the case that you said, precision, recall and etc. are zero. You didn't retrieve any relevant document, all your evaluation metrics will be zero. – Alikbar Nov 28 '16 at 05:07
what do i mean is that the "cranqrel" does not label all the document. So if the retrieved document is not in the "cranqrel", how to identify the relevancy?? we cannot say that the document is irrelevancy because "cranqrel" does not have the relevancy for this document. – dd90p Nov 28 '16 at 05:23
As it said in the readme file, there are five types of relations between a query and document. Number 5 shows that document and query are not relevant, and in "cranqrel" these relations are not included. So you should assume that any document and query that their relation is not included in "cranqrel", are irrelevant. – Alikbar Nov 28 '16 at 05:46
can I ask one more question about the relevancy? in the "cranqrel", there are some scores set as "-1", for the document contain "-1", should it belongs to relevant or irrelevant? the readme only explain the score from 1-5, it never mention -1 – dd90p Nov 28 '16 at 07:34

some ideas and direction of how to measure ranking, AP, MAP, recall for IR evaluation

2 Answers2

Linked