4

I have a row of data, say A = [0 1 1 1 0 0].

Matrix B contains many rows. For a dummy example let's say it's just B = [1 1 1 0 1 0; 1 0 0 1 0 1].

I want to find the number of columns in which A and a row of B differ, and use that vector of differences to find which row of B is most similar to A. So for the example above, A differs from B(1,:) in columns 1, 4, 5 = 3 total difference. A differs from B(2,:) in columns 1, 2, 3, 6 = 4 total differences, and so I would want to return index 1 to indicate that A is most similar to B(1,:).

In reality B has ~50,000 rows, and A and B both have about 800 columns. My current code to find the most similar row is below:

min(sum(xor(repmat(A,B_rows,1),B),2));

That works, but it's very slow. Any insights into which function is taking so long and ways to improve it?

Robert Seifert
  • 25,078
  • 11
  • 68
  • 113
user1956609
  • 2,132
  • 5
  • 27
  • 43

5 Answers5

3

There are 6 more or less similar approaches with bsxfun, hard to tell which one is the fastest.

%example data
A = [0 1 1 1 0 0];
B = perms(A);     %720x6 matrix

Counting similarities:

Z = sum(  bsxfun(@eq,  A,B) , 2);
Z = sum( ~bsxfun(@ne,  A,B) , 2);
Z = sum( ~bsxfun(@xor, A,B) , 2);

Counting differences:

Z = sum( ~bsxfun(@eq,  A,B) , 2);
Z = sum(  bsxfun(@ne,  A,B) , 2);
Z = sum(  bsxfun(@xor, A,B) , 2);

Z is a vector containing the number of equal/unequal elements of A for every row of B.

Benchmark for 100 trials each (same order as above):

t100 =

    0.0081
    0.0098
    0.0123
    0.0102
    0.0087
    0.0111
Robert Seifert
  • 25,078
  • 11
  • 68
  • 113
3

Use pdist2 with 'hamming' distance

[D I] = pdist2( A, B, 'hamming', 'Smallest', 1 );
Shai
  • 111,146
  • 38
  • 238
  • 371
  • In case it's not faster, it's at least a lot easier to understand. – Jonas Jan 14 '14 at 17:40
  • Ahh ok that's a new function for me. Thanks for all the answers everyone, this was very helpful. – user1956609 Jan 14 '14 at 18:40
  • @Shai: can you explain your solution, what is better compared to the others? (at)user1956609 you have chosen the slowest, any reason for that? I'm just curios :) – Robert Seifert Jan 14 '14 at 18:48
  • Yeah the reason is I misread the decimal point and just sat through an hour of this running, ha. Thanks for the update, trying the others soon. – user1956609 Jan 14 '14 at 20:41
2

Benchmark for all solutions

(Thanks to Amro for the Benchmark-code)

function [t] = bench()
    A = [0 1 1 1 0 0];
    B = perms(A);

    % functions to compare
    fcns = {
        @() compare1(A,B);
        @() compare2(A,B);
        @() compare3(A,B);
        @() compare4(A,B);
    };

    % timeit
    t = zeros(4,1);
    for ii = 1:100;
        t = t + cellfun(@timeit, fcns);
    end
end

function Z = compare1(A,B)  %thewaywewalk
    Z = sum(  bsxfun(@eq,  A,B) , 2);
end
function Z = compare2(A,B)  %Matt J
    Z = sum(bsxfun(@xor, A, B),2);
end
function Z = compare3(A,B)  %Luis Mendo
    A = logical(A);
    Z = sum(B(:,~A),2) + sum(~B(:,A),2);
end
function Z = compare4(A,B)  %Shai
     Z = pdist2( A, B, 'hamming', 'Smallest', 1 );
end

result for 100 trials and a 720x6 matrix for B:

0.0068s   %thewaywewalk
0.0092s   %Matt J
0.0077s   %Luis Mendo
0.0399s   %Shai

result for 100 trials and a 40320x8 matrix for B:

0.0889s   %thewaywewalk
0.1512s   %Matt J
0.4571s   %Luis Mendo
0.4578s   %Shai

The bsxfun approach seems to be the fastest, and for that by using the anonymous function @eq.

Community
  • 1
  • 1
Robert Seifert
  • 25,078
  • 11
  • 68
  • 113
  • 1
    you do not take into account that my solution gives as an output the nearest-neighbor: distance and index. It amounts to another call `[D I]=min(Z);` after each of all other candidates. I don't think this is going to change the timing results radically, but it may have an influence. Moreover, you might want to consider the effect of input type: setting `a – Shai Jan 14 '14 at 21:20
1

Sounds like the bsxfun command is what you need

min(sum(bsxfun(@xor, A, B),2))
Matt J
  • 1,127
  • 7
  • 15
1

Another possibility, which replaces xor by indexing (not sure if it's faster than previous answers):

A = logical(A);
min(sum(B(:,~A),2) + sum(~B(:,A),2))

Or, to avoid ~ on B:

A = logical(A);
min(sum(B(:,~A),2) - sum(B(:,A),2)) + nnz(A)
Luis Mendo
  • 110,752
  • 13
  • 76
  • 147