Mean average precision computed at k (for top-k elements in the answer), according to wiki, ml metrics at kaggle, and this answer: Confusion about (Mean) Average Precision should be computed as mean of average precisions at k, where average precision at k is computed as:
Where: P(i) is the precision at cut-off i in the list; rel(i) is an indicator function equaling 1 if the item at rank i is a relevant document, zero otherwise.
The divider min(k, number of relevant documents)
has the meaning of maximum possible number of relevant entries in the answer.
Is this understanding correct?
Is MAP@k always less than MAP computed for all ranked list?
My concern is that, this is not how MAP@k is computed in many works.
It is typical, that the divider is not min(k, number of relevant documents)
, but the number of relative documents in the top-k. This approach will give higher value of MAP@k.
HashNet: Deep Learning to Hash by Continuation" (ICCV 2017)
Code: https://github.com/thuml/HashNet/blob/master/pytorch/src/test.py#L42-L51
for i in range(query_num):
label = validation_labels[i, :]
label[label == 0] = -1
idx = ids[:, i]
imatch = np.sum(database_labels[idx[0:R], :] == label, axis=1) > 0
relevant_num = np.sum(imatch)
Lx = np.cumsum(imatch)
Px = Lx.astype(float) / np.arange(1, R+1, 1)
if relevant_num != 0:
APx.append(np.sum(Px * imatch) / relevant_num)
Where relevant_num
is not the min(k, number of relevant documents)
, but number of relevant documents in the result, which is not the same as total number of relative documents or k.
Am I reading wrong the code?
Deep Visual-Semantic Quantization of Efficient Image Retrieval CVPR 2017
Code: https://github.com/caoyue10/cvpr17-dvsq/blob/master/util.py#L155-L178
def get_mAPs_by_feature(self, database, query):
ips = np.dot(query.output, database.output.T)
#norms = np.sqrt(np.dot(np.reshape(np.sum(query.output ** 2, 1), [query.n_samples, 1]), np.reshape(np.sum(database.output ** 2, 1), [1, database.n_samples])))
#self.all_rel = ips / norms
self.all_rel = ips
ids = np.argsort(-self.all_rel, 1)
APx = []
query_labels = query.label
database_labels = database.label
print "#calc mAPs# calculating mAPs"
bar = ProgressBar(total=self.all_rel.shape[0])
for i in xrange(self.all_rel.shape[0]):
label = query_labels[i, :]
label[label == 0] = -1
idx = ids[i, :]
imatch = np.sum(database_labels[idx[0: self.R], :] == label, 1) > 0
rel = np.sum(imatch)
Lx = np.cumsum(imatch)
Px = Lx.astype(float) / np.arange(1, self.R+1, 1)
if rel != 0:
APx.append(np.sum(Px * imatch) / rel)
bar.move()
print "mAPs: ", np.mean(np.array(APx))
return np.mean(np.array(APx))
Where divider is rel
, which is computed as np.sum(imatch)
, where imatch
is a binary vector that indicates if the entry is relevant or not. The problem is that it takes only first R
: imatch = np.sum(database_labels[idx[0: self.R], :] == label, 1) > 0
. So np.sum(imatch)
will give number of relevant entries in the returned list of size R
, but not min(R, number of relevant entries)
. And note that values of R
used in the paper are less than number of entries in DB.
Deep Learning of Binary Hash Codes for Fast Image Retrieval (CVPR 2015)
Code: https://github.com/kevinlin311tw/caffe-cvprw15/blob/master/analysis/precision.m#L30-L55
buffer_yes = zeros(K,1);
buffer_total = zeros(K,1);
total_relevant = 0;
for j = 1:K
retrieval_label = trn_label(y2(j));
if (query_label==retrieval_label)
buffer_yes(j,1) = 1;
total_relevant = total_relevant + 1;
end
buffer_total(j,1) = 1;
end
% compute precision
P = cumsum(buffer_yes) ./ Ns';
if (sum(buffer_yes) == 0)
AP(i) = 0;
else
AP(i) = sum(P.*buffer_yes) / sum(buffer_yes);
end
Here the divider is sum(buffer_yes)
which is number of the relative documents in the returned list of size k, not min(k, number of relevant documents)
.
"Supervised Learning of Semantics-Preserving Deep Hashing" (TPAMI 2017)
Code: https://github.com/kevinlin311tw/Caffe-DeepBinaryCode/blob/master/analysis/precision.m
Code is the same as in the previouse paper.
Learning Compact Binary Descriptors with Unsupervised Deep Neural Networks (CVPR 2016)
Same code: https://github.com/kevinlin311tw/cvpr16-deepbit/blob/master/analysis/precision.m#L32-L55
Am I missing something? Is the code in the papers above correct? Why it does not coincide with https://github.com/benhamner/Metrics/blob/master/Python/ml_metrics/average_precision.py#L25-L39 ?
Update
I found this closed issue, referring the same problem: https://github.com/thuml/HashNet/issues/2
Is claim the following claim correct?
AP is a ranking metric. If the top 2 retrievals in the ranked list are relevant (and only the top 2), AP is 100%. You're talking about Recall, which in this case is indeed 0.2%.
From my understanding, if we treat AP as area under PR curve, the claim above is not correct.
P.S. I was in doubt if this should go to Cross Validated or to StackOverflow. If you think that it is better to place it to Cross Validated I don't mind. My reasoning was that it is not a theoretical question, but implementation one with reference to actual code.