2

Assuming that I have a dataset of the following size:

train = 500,000 * 960  %number of training samples (vector) each of 960 length

B_base = 1000000*960 %number of base samples  (vector) each of 960 length


Query = 1000*960  %number of query samples  (vector) each of 960 length

truth_nn = 1000*100

truth_nn contains the ground truth neighbors in the form of the pre-computed k nearest neighbors and their square Euclidean distance. So, the columns of truth_nn represent the k = 100 nearest neighbors. I am finding difficult to apply nearest neighbor search in the code snippet. Can somebody please show how to apply the ground truth neighbors truth_nn in finding the mean average precision-recall?

It will be of immense help if somebody can show with any small example by creating any data matrix, query matrix, and the ground truth neighbors in the form of the pre-computed k nearest neighbors and their square Euclidean distance. I tried creating a sample database.

Assume, the base data is

B_base = [1 1; 2 2; 3 2; 4 4; 5 6];

Query data is

 Query = [1 1; 2 1; 6 2];

[neighbors distances] = knnsearch(a,b,'k',2);

would find 2 nearest neighbors.

Question 1: how do I create the truth data containing the ground truth neighbors and pre-computed k nearest neighbor distances? This is called as the mean average precision recall. I tried implementing the knearest neighbor search and the average precision recall as follows but cannot understand (unsure) how to apply the ground truth table

Question 2:

I am trying to apply k nearest neighbor search by converting first the real-valued features into binary.

I am unable to apply the concept of k-nearest neighbor search for different values of k = 10,20,50 and to check how much data has been correctly recalled using the GIST database. In the GIST truth_nn() file, when I specify truth_nn(i,1:k) for a query vector i, the function AveragePrecision throws error. So, if somebody can show using any sample ground truth that is of similar structure to that in GIST, how to properly specify k and calculate the Average precision recall, then I shall be able to apply the solution to the GIST database. As of now, this is my approach and shall be of immense help if the correct way is provided using any example that will be easier for me to relate to the GIST database. The problem is on how I can find neighbors from the ground truth and compare it with the neighbors obtained after sorting the distances?

I am also interested on how I can apply pdist2() instead of the present distance calcualtion as it takes a long time.

 numQueryVectors = size(Query,1);
       %Calculate distances
     for i=1:numQueryVectors,
      queryMatrix(i,:)
      dist = sum((repmat(queryMatrix(i,:),numDataVectors,1)-B_base ).^2,2);
     [sortval sortpos] = sort(dist,'ascend');
      neighborIds(i,:) = sortpos(1:k);
     neighborDistances(i,:) = sqrt(sortval(1:k));
    end


        %Sorting calculated nearest neighbor distances for k = 50



 %HOW DO I SPECIFY k = 50 in the ground truth, truth_nn
for i=1:numQueryVectors
  AP(i) = AveragePrecision(neighborIds(i,:),truth_nn(i,:));
end
mAP = mean(AP);


  function ap = AveragePrecision(rank_id, truth_id)
    truth_num = length(truth_id);


truth_pos = zeros(truth_num,1);

for j=1:50  %% for k = 50 nearest neighbors
    truth_pos(j) = find(rank_id == truth_id(j));
end
truth_pos = sort(truth_pos, 'ascend');

% compute average precision as the area below the recall-precision curve
ap = 0;
delta_recall = 1/truth_num;
for j=1:truth_num
    p = j/truth_pos(j);
    ap = ap + p*delta_recall;
end

    end
end

UPDATE : Based on solution, I tried to calculate the mean average precision using the formula given formula hereand a reference code . But, not sure if my approach is correct because the theory says that I need to rank the returned queries based on the indices. I do not understand this fully. Mean average precision is required to judge the quality of the retrieval algortihm.

precision = positives/total_data;
recal = positives /(positives+negatives);
precision = positives/total_data;
recall = positives /(positives+negatives);
truth_pos = sort(positives, 'ascend');
truth_num = length(truth_pos);

ap = 0;
delta_recall = 1/truth_num;
for j=1:truth_num
    p = j/truth_pos(j);
    ap = ap + p*delta_recall;
end
ap

The value of ap = infinity , value of positive = 0 and negatives = 150. This means that knnsearch() does not work at all.

Community
  • 1
  • 1
SKM
  • 959
  • 2
  • 19
  • 45
  • How is this different from your other question(s) on this topic? – beaker Jun 19 '16 at 19:58
  • @beaker: In my other Question, I had asked how to create multiple hash tables for alogirhtm - Locality Sensitive Hashing. Then I asked how I can work with GIST database. In particular, I am struggling in how to apply the ground truth table that consists of the actual labels and distance. Since, this was a very specific question I thought of asking a general one where I created a simple query and base data. Now, I do not know how I can create the ground truth table. My objective is to apply the nearest neighbor search and assess the quality using the average precision recall metric. – SKM Jun 27 '16 at 07:49
  • In order to apply average precision recall, I believe we need the ground truth table. The GIST database does have one, but I do not understand how to use it. Therefore, I ask here to help in showing how to apply the ground truth in nearest neighbor and calculating the average precision recall, with the help of any sample ground table that has the same structure as that of the GIST database's ground truth table. – SKM Jun 27 '16 at 07:51
  • 1
    @halfer: I have answered to the comment, please let me know if it is clear or not. Thank you – SKM Jun 27 '16 at 07:53
  • I've no idea, not my area - but it is good to reply! – halfer Jun 27 '16 at 07:57
  • I don't think you have to sort the list, matlab already sorts the neighbors in ascending order for each row. I think the question is how do you evaluate correctness? say your search says `neighbors(1,:) = [2,5,8,2]` but the `truth_nn=[5,4,8,2]` the first match 3 matches are incorrect, but the other 2 are correct, does this mean 1 error for the entire row? or 2 errors for each incorrect measurement? – andrew Jun 29 '16 at 17:20
  • also how do you compute precision and recall? For this data, what is a false positive? false negative? true positive? true negative? – andrew Jun 29 '16 at 17:53
  • I have updated the Question with a code explaining how I calculated the precision and recall. The ground truth dataset in the GIST database contains true neighbors in the form of the pre-computed k = 1 to 100 nearest neighbors and their square Euclidean distance. So, I need to compare the nearest neighbor index returned from the knnsearch with that in the truth_nn() – SKM Jun 29 '16 at 18:13

1 Answers1

1

I think you are doing extra work. This process is very simple in matlab, you can also operate on entire arrays. This should be faster than for loops, and is a bit easier to read.

Your truth_nn and neighbors should have the same data, if there are no errors. There is one entry per row. Matlab already sorts the kmeans result in ascending order, so the column 1 is the closest neighbor, the second closest is column 2, 3rd closest is 3,.... There is no need to sort the data again.

Just compare truth_nn to neighbors to get your statistics. This is a simple example to show you how the program should go. It will not work on your data without some modification

%in your example this is provided, I created my own
truth_nn = [1,2;
            1,3;
            4,3];

B_base = [1 1; 2 2; 3 2; 4 4; 5 6];
Query = [1 1; 2 1; 6 2];

%performs k means
num_clusters = 2;
[neighbors distances] = knnsearch(B_base,Query,'k',num_clusters);

%--- output---
% neighbors = [1,2; 
%              1,2; notice this doesn't match truth_nn 1,3
%              4,3]
% distances = [     0    1.4142;
%              1.0000    1.0000;
%              2.8284    3.0000];

%computes statistics, nnz counts number of nonzero elements, in the first
%case every piece of data that matches 
%NOTE1: the indexing on truth_nn (:,1:num_clusters ) it says all rows
%       but only use the first num_clusters columns. This should
%       prevent the dimension mistmatch error you were getting
positives = nnz(neighbors == truth_nn(:,1:num_clusters ));     %result = 5
negatives = nnz(neighbors ~= truth_nn(:,1:num_clusters ));     %result = 1
%NOTE1: I've switched this from truth_nn to neighbors, this helps
%       when you cahnge num_neghbors 
total_data = numel(neighbors);               %result = 6

percent_incorrect = 100*(negatives / total_data);   % 16.6666
percent_correct   = 100*(positives / total_data);   % 93.3333
andrew
  • 2,451
  • 1
  • 15
  • 22
  • Thank you for your reply, I am going through your solution approach. I have to ask whether this method Is applicable to binary data i.e., if the elements are 0 and 1's (the feature space is binary). – SKM Jun 29 '16 at 19:27
  • For the structure of the truth_nn, I downloaded the database from http://corpus-texmex.irisa.fr/ I am using the GIST descriptor database called the ANN_GIST1M Under the description it says that the groundtruth file contains, for each query, the identifiers (vector number, starting at 0) of its k nearest neighbors, ordered by increasing distance. • k=100 for the dataset. The first element of each integer vector is the nearest neighbor identifier associated with the query. So, could you please show how I can apply your proposed solution to work for truth_nn file – SKM Jun 29 '16 at 19:27
  • The data set is too large for me to download. But if the data is as you say, neighbors only (no distances) the code I gave you will work perfectly fine. Just use your own datasets instead of the variables I declared in the first few lines. everything under `%performs k means` does not need modification – andrew Jun 29 '16 at 20:10
  • I am implementing your approach, however apart from the statistics in % that you gave i want to get the mean average precision recall for every query. This would give an evaluation of the ranked retrieval and so I can get a curve like this http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html I can get the mean average precision recall next by averagin the answer. Can you please include the code for this as well in your answer? Thank you – SKM Jun 29 '16 at 22:49
  • I implemented your code on GIST database, and I am getting the error : Error using == Matrix dimensions must agree. Error in Evaluate (line 127) positives = nnz(neighbors == truth_nn); The problem goes away if k =100, anything lesser than that throws this error. The size(neighbors) = 1000 by 50 and size(truth_nn) = 1000 by 100 then there is no error, but the result is incorrect which is 0% !! – SKM Jun 29 '16 at 23:26
  • With the way you defined precision `ap=percent_correct` in my own code. But precision and recall is usually meant for binary classifiers.a `false positive` is when the system said `true` but it should say `false` that concept cant be applied here. The same goes for `false negative` that's why i asked in my comment to the question how you quantify these values. I also fixed the code (see **NOTE1**) to work if you use `k~=100` the reason for the error is that `neighbors` and `truth_nn` were not the same size `truth_nn:1000*100` and `neighbors:1000*k` so the comparison wouldn't work – andrew Jun 30 '16 at 00:58
  • The comparison works, the error is not there anymore. However, the metric value is not correct, as percent_correct = 0. I double checked by applying the formula to fisheriris database in Matlab and the percent_correct is like around 1.2 . In several papers, for multiclass and binary as well, mean average precision (MAP) is used. Based on explanaton here https://sanchom.wordpress.com/tag/average-precision/ I need to rank the returned queries. I have added the code but do not know how to calculate it correctly. Could you please show how I can calculate the mean average precision? – SKM Jun 30 '16 at 02:30
  • If the `percent_correct` is 0 and you know the system is performing well, that means the data is not in the same format for `truth_nn` and `neighbors` Since I don't have the data you'll have to investigate this yourself. Both links you pointed to show precision and recall for binary systems. You can make a multiclass system(like CBIR) binary using a different goal condition. for example: Look at only cluster 1(even though there are 50), 1 identified as 1 = true positive; 1 identified not(1) = false negative; not(1) identified as 1 = false positive; from there you can calculate the metrics – andrew Jun 30 '16 at 16:51