1

I need to do function that works like this :

N1 = size(X,1);
N2 = size(Xtrain,1);
Dist = zeros(N1,N2);

    for i=1:N1
        for j=1:N2
            Dist(i,j)=D-sum(X(i,:)==Xtrain(j,:));
        end
    end

(X and Xtrain are sparse logical matrixes)

It works fine and passes the tests, but I believe it's not very optimal and well-written solution.

How can I improve that function using some built Matlab functions? I'm absolutely new to Matlab, so I don't know if there really is an opportunity to make it better somehow.

michalsol
  • 752
  • 8
  • 29
  • Yes, I know pdist2(X,Xtrain,'hamming').*D works fine, but I want something else. I'm learing Matlab and would like to know if I can somehow replace that for loops statements with something better. – michalsol Mar 25 '16 at 17:24
  • you could write it as one for-loop, each iteration vectorized over entire column. You could even compute without any loops, but that would involve repeating the data several times consuming much more memory.. But again, what's wrong with `pdist2`? – Amro Mar 25 '16 at 17:26
  • How can I do it as one for-loop iteration? I've tried it but couldn't achieve that. There's nothing wrong with pdist2, eventually I probably will do it with that function. I just try to understand matrix and vectors operations in Matlab and I would like to know how can I do it, eg. using vectorized operations (as you mentioned). – michalsol Mar 25 '16 at 17:37
  • see my answer below for several implementations – Amro Mar 25 '16 at 18:23

1 Answers1

4

You wanted to learn about vectorization, here some code to study comparing different implementations of this pair-wise distance.

First we build two binary matrices as input (where each row is an instance):

m = 5;
n = 4;
p = 3;
A = double(rand(m,p) > 0.5);
B = double(rand(n,p) > 0.5);

1. double-loop over each pair of instances

D0 = zeros(m,n);
for i=1:m
    for j=1:n
        D0(i,j) = sum(A(i,:) ~= B(j,:)) / p;
    end
end

2. PDIST2

D1 = pdist2(A, B, 'hamming');

3. single-loop over each instance against all other instances

D2 = zeros(m,n);
for i=1:n
    D2(:,i) = sum(bsxfun(@ne, A, B(i,:)), 2) ./ p;
end

4. vectorized with grid indexing, all against all

D3 = zeros(m,n);
[x,y] = ndgrid(1:m,1:n);
D3(:) = sum(A(x(:),:) ~= B(y(:),:), 2) ./ p;

5. vectorized in third dimension, all against all

D4 = sum(bsxfun(@ne, A, reshape(B.',[1 p n])), 2) ./ p;
D4 = permute(D4, [1 3 2]);

Finally we compare all methods are equal

assert(isequal(D0,D1,D2,D3,D4))
Amro
  • 123,847
  • 25
  • 243
  • 454
  • careful that the vectorized method will create large intermediate arrays, and could potentially throw out-of-memory error if the input matrices are large enough. It is in the order of `m*n*p` in terms of space. The single-loop has `m*n` space, while the double-loop takes only `m` space (really `min(m,n)` because we can loop over rows or columns, whatever is smaller) – Amro Mar 25 '16 at 18:33
  • Note: the same can be done with other distance functions. Here's the Euclidean distance: http://stackoverflow.com/q/7696734/97160 – Amro Mar 25 '16 at 18:40
  • Thank you! I will analyse all of that solutions, – michalsol Mar 26 '16 at 14:44