3

I have a question. I am trying to compute pairwise distances between vectors. Let me first explain the problem: I have two sets of vectors X and Y. X has three vectors x1, x2 and x3. Y has three vectors y1, y2 and y3. Note vectors in X and Y are of length m and n respectively. Let the dataset be represented as this image:

enter image description here

I am trying to compute a similarity matrix such as this:

enter image description here. Now the different colour coded parts are explained - All those cells marked with 0 need not be computed. I have intentionally put it as 100 (it can be any value). The grey cells have to be computed. The similarity score is computed as the L2norm of (xi-xj) + L2 norm of (yi-yj).

Which means the entries are

M((x_i,y_j), (x_k,y_l)) := norm(x_i-x_k,2) + norm(y_j-y_l,2)

I have written a basic code to do this:

clc;clear all;close all;
%% randomly generate data
m=3; n1=4; n2=6;
train_a_mean = rand(m,n1);
train_b_mean = rand(m,n2);
p = size(train_a_mean,1)*size(train_b_mean,1);
score_mean_ab = zeros(p,p);

%% This is to store the index variables 
%% This is required for futu
idx1 = score_mean_ab;
idx2 = idx1; idx3 = idx1; idx4 = idx1;

a=1; b=1;
for i=1:size(score_mean_ab,1)
    c = 1; d = 1;
    for j=1:size(score_mean_ab,2)
        if (a==c)
            score_mean_ab(i,j) = 100;
        else            
            %% computing distances between the different modalities and
            %% summing them up
            score_mean_ab(i,j) = norm(train_a_mean(a,:)-train_a_mean(c,:),2) ...
            + norm(train_b_mean(b,:)-train_b_mean(d,:),2);
        end
        %% saving the indices
        idx1(i,j)=a; idx2(i,j)=b; idx3(i,j)=c; idx4(i,j)=d;        
        %% updating the values of c and d
        if mod(d,size(train_a_mean,1))==0
            c = c + 1;
            d = 1;
        else
            d = d+1;
        end
    end
    %% updating the values of a and b
    if mod(b,size(train_a_mean,1))==0
        a = a + 1;
        b = 1;
    else        
        b = b+1;
    end
end

For a dry sample run of the matrix: I get these results -

score_mean_ab =

  100.0000  100.0000  100.0000    0.6700    1.6548    1.5725    0.8154    1.8002    1.7179
  100.0000  100.0000  100.0000    1.6548    0.6700    1.5000    1.8002    0.8154    1.6454
  100.0000  100.0000  100.0000    1.5725    1.5000    0.6700    1.7179    1.6454    0.8154
    0.6700    1.6548    1.5725  100.0000  100.0000  100.0000    1.3174    2.3022    2.2200
    1.6548    0.6700    1.5000  100.0000  100.0000  100.0000    2.3022    1.3174    2.1475
    1.5725    1.5000    0.6700  100.0000  100.0000  100.0000    2.2200    2.1475    1.3174
    0.8154    1.8002    1.7179    1.3174    2.3022    2.2200  100.0000  100.0000  100.0000
    1.8002    0.8154    1.6454    2.3022    1.3174    2.1475  100.0000  100.0000  100.0000
    1.7179    1.6454    0.8154    2.2200    2.1475    1.3174  100.0000  100.0000  100.0000

However my code is very slow. I took a very few sample runs and got these results:

m=3; n1=3; n2=3;
Elapsed time is 0.000363 seconds.

m=10; n1=3; n2=3;
Elapsed time is 0.042015 seconds.

m=10; n1=1800; n2=1800;
Elapsed time is 0.230046 seconds.

m=20; n1=1800; n2=1800;
Elapsed time is 4.309134 seconds.

m=30; n1=1800; n2=1800;
Elapsed time is 23.058106 seconds.

My Questions :

  1. Typically I will have values of m~100 and n1~2000 and n2~2000. My own code breaks down at this point. Is there any optimised way to do this ?
  2. Can the inbuilt matlab function pdist2 be used for this purpose?

NOTE: The vectors are actually in the form of row vectors and the value of n1 and n2 may not be equal.

roni
  • 1,443
  • 3
  • 28
  • 49
  • 1
    Please elaborate on the *similarity score*. What is the entry of the matrix `M((x_i,y_j), (x_k,y_l))`? – knedlsepp Feb 25 '15 at 13:36
  • The similarity score is computed as follows : norm(x_i-x_k,2) + norm(y_j-y_l,2). Is this clear now? I am computing the similarity between x_i and x_k & between y_j and y_l and adding them up. – roni Feb 25 '15 at 13:47
  • 1
    Please add this information to the question! Looks like this could be achieved by doing one `pdist2(...,'euclidean')` and one `pdist2(...,'cityblock')` using on those distances. Don't know about the details yet. Also you didn't specify if your data are column or row vectors. – knedlsepp Feb 25 '15 at 13:52
  • So, whatever maybe the sizes of input vectors, `score_mean_ab` would always be `9 x 9`, right? – Divakar Feb 25 '15 at 14:01
  • @Divakar: I think this is a bug of *roni*'s code. He generates the transposed data... – knedlsepp Feb 25 '15 at 14:16
  • @knedlsepp Yeah, I changed `n1` and `n2` and the output size remains the same, so that's why I had that doubt. Ah! I must have changed `m` too! – Divakar Feb 25 '15 at 14:19
  • @Divakar: I take it back. I read it again. You are right: It should be `9x9` – knedlsepp Feb 25 '15 at 14:25
  • sorry for the late reply. @Divakar the value of the score matrix should always be m x m. I hope this is clear. To answer knedlsepp, my data is actually row vectors. – roni Feb 25 '15 at 14:34
  • 2
    @roni You mean the output size would be `m^2 X m^2`, right? – Divakar Feb 25 '15 at 14:39
  • @Divakar yes. That's correct. I am sorry for that confusion. It actually occurred as my data vectors are rows and not "columns". Anyway can any of you guys do a speed-time comparison for different values of m? I don't think speed time will be that dependent on n. Is my assumption correct? – roni Feb 25 '15 at 14:41
  • @roni: What are your data's actual sizes? – knedlsepp Feb 25 '15 at 14:43
  • @roni Do you need to set those diagonal blocks to be all 100's or can we set those to all zeros or leave them as whatever they might be? – Divakar Feb 25 '15 at 14:44
  • @knedlsepp well I am using images of visual and infrared spectrum. Let's say I have 100 images of each (my typical value). Each image have a feature vector length of size 1792 (visual) and 1500 (infrared). So my matrices are of dimension 100x1792 and 100x1500. – roni Feb 25 '15 at 14:45
  • @Divakar Actually I do not want to compute those distances (as in my algorithm they will not be required). So to save computation time I designed my algorithm in such a way. The value can be anything (but I took a high value) - because after this operation I require to find the minimum index in each row of my score matrix. While computing the min indices I do not want to consider those cells marked with no 100. – roni Feb 25 '15 at 14:49
  • 1
    If you require only the indices, we don't need to compute the full matrix though, at is symmetric (with symmetric blocks)... Also: Divakar already provides code to delete those entries, which you can use for any of the three solutions. – knedlsepp Feb 25 '15 at 14:55
  • @knedlsepp can you elaborate on your comment "don't need to compute the full matrix"? I am trying to design a kind of cross-similarity matching score and then I will use the k-nearest neighbours. That's why I require those indices. But yes you are right - The matrix is symmetric and so maybe just computing the upper-triangular one or the lower-triangular one will be conveniently faster. Is my assumption correct? – roni Feb 25 '15 at 15:03
  • I think this could be simplified even more using the structure of your problem. Look the output of `spy(bsxfun(@eq,out,min(out,[],2)))` for example. (When diagonal blocks are set to `100` or `inf`). Computing one row per block will be enough to determine the indices of said block, maybe even less with a clever approach. – knedlsepp Feb 25 '15 at 15:16
  • 1
    @roni Were you able to test out the various approaches listed here? – Divakar Feb 26 '15 at 13:47
  • @Divakar yes I did I will upload the results today. I tested it for m = 1 to 100. Please give me some time I will do it. – roni Feb 27 '15 at 04:07

3 Answers3

3

Here's a way to do it. This computes all entries.

m = 3;             %// number of (row) vectors in X and in Y
n1 = 3;            %// length of vectors in X
n2 = 3;            %// length of vectors in Y
X = rand(m, n1);   %// random data: X
Y = rand(m, n2);   %// random data: Y

[ii, jj] = ndgrid(1:m); 
U = reshape(sqrt(sum((X(ii,:)-X(jj,:)).^2, 2)), m, m);
V = reshape(sqrt(sum((Y(ii,:)-Y(jj,:)).^2, 2)), m, m);
result = U(ceil(1/m:1/m:m), ceil(1/m:1/m:m)) + repmat(V, m, m);

Or you could use bsxfun instead of ndgrid:

U = sqrt(sum(bsxfun(@minus, permute(X, [1 3 2]), permute(X, [3 1 2])).^2, 3));
V = sqrt(sum(bsxfun(@minus, permute(Y, [1 3 2]), permute(Y, [3 1 2])).^2, 3));
result = U(ceil(1/m:1/m:m), ceil(1/m:1/m:m)) + repmat(V, m, m);
Luis Mendo
  • 110,752
  • 13
  • 76
  • 147
  • Ok. I see we are using the same approach, you avoiding `pdist2` and `kron`. I was unsure for a while if I understood the question correctly. – knedlsepp Feb 25 '15 at 14:33
  • Yep. I tried to use "simpler" (?) functions for speed – Luis Mendo Feb 25 '15 at 14:35
  • I used a slightly different approach on bsxfun to go 4th dim :) – Divakar Feb 25 '15 at 14:36
  • @Divakar As I was writing by `bsxfun` approach I thought about that. I discarded it as too difficult. And then I thought "I know who is working on _that_ right now!" :-D – Luis Mendo Feb 25 '15 at 14:39
  • @LuisMendo haha well, yeah I guess I did that "plus" thing with bsxfun, that is sort of different. – Divakar Feb 25 '15 at 14:40
  • Hi thanks for this code. But can those cell entries that I had marked with 0 be set a predefined value? I really need that. – roni Feb 25 '15 at 14:52
  • @roni My code can't avoid computing those entries (at least not easily). But you could use the last part of Divakar's answer to set those entries to a predefined value – Luis Mendo Feb 25 '15 at 14:59
  • @LuisMendo yes saw that just now. Sorry for that query. Thanks for the great answer. I wonder which of these methods will be faster ? – roni Feb 25 '15 at 15:00
  • 1
    @roni Can you test it yourself? You know your data sizes etc. Are you familiar with the `timeit` function? It's better than `tic`/`toc` for timing – Luis Mendo Feb 25 '15 at 15:05
2

You can achieve this using:

m = 3; % Number of vectors in X/Y (must have same number of vectors)
XD = squareform(pdist(X)); %// == pdist2(X,X) but faster
YD = squareform(pdist(Y)); %// == pdist2(Y,Y) but faster
M = kron(XD,ones(m,m)) + repmat(YD,m,m);

Pay attention that in order to make pdist work, X and Y must be given as row vectors. Also: Ignore the diagonal blocks.

knedlsepp
  • 6,065
  • 3
  • 20
  • 41
2

Assuming A as train_a_mean and B as train_b_mean for easy access within the codes, you can use two approaches here to get to your final destination, which is the array of row-wise minimum indices of your output score_mean_ab.

Approach #1

This approach is based on bsxfun for getting the norm and its summations and also for getting the linear indices to set the "diagonal block" elements as all Infs as per the question's requirements. Here's the implementation -

%// Parameter
M = m^2; 

%// Get pairwise norms
nm1 = sqrt(sum(bsxfun(@minus,A,permute(A,[3 2 1])).^2,2));
nm2 = sqrt(sum(bsxfun(@minus,B,permute(B,[3 2 1])).^2,2));

%// Get sum of norms and the final values
norm_sum = bsxfun(@plus,nm1,permute(nm2,[2 1 4 3]));

%// Get "diagonal block" elements and set them to all Infs
ind1 = bsxfun(@plus,[1:m:M]',[0:m-1]*(M+1));  %//'
ind2 = bsxfun(@plus,ind1(:),[0:m-1]*m^3);
norm_sum(ind2) = Inf;

[~,min_idx] = min(reshape(norm_sum,m,M,[]),[],2);
min_idx = reshape(reshape(min_idx,m,[])',[],1);

Approach #2

This approach ab(uses) matrix multiplication based distance matrix calculation for a possibly faster solution. The code is listed next -

%// Parameters
nA = size(A,2);
nB = size(B,2);
M = m^2;

%// Get the pairwise norms for both A and B
A_t = A';  %//'
norm_a = sqrt([-2*A A.^2 ones(m,nA)]*[A_t ; ones(nA,m) ; A_t.^2])
norm_a(1:m+1:end) = 0;
B_t = B';  %//'
norm_b = sqrt([-2*B B.^2 ones(m,nB)]*[B_t ; ones(nB,m) ; B_t.^2])
norm_b(1:m+1:end) = 0;

%// Norm sums
norm_sum = reshape(bsxfun(@plus,norm_a(:).',norm_b(:)),m,m,[]) %//'

%// Set the "diagonal blocks" as all Infs
norm_sum(:,:,1:m+1:M) = Inf

%// Re-arrange into the desired 2D output and get the minimum indices
out = reshape(permute(reshape(permute(norm_sum,[1 3 2]),M,m,[]),[1 3 2]),M,M);
[~,min_idx] = min(out,[],2);
Community
  • 1
  • 1
Divakar
  • 218,885
  • 19
  • 262
  • 358