Faster way to take xor between one row and a matrix of 1's and 0's?

Question

I have a row of data, say A = [0 1 1 1 0 0].

Matrix B contains many rows. For a dummy example let's say it's just B = [1 1 1 0 1 0; 1 0 0 1 0 1].

I want to find the number of columns in which A and a row of B differ, and use that vector of differences to find which row of B is most similar to A. So for the example above, A differs from B(1,:) in columns 1, 4, 5 = 3 total difference. A differs from B(2,:) in columns 1, 2, 3, 6 = 4 total differences, and so I would want to return index 1 to indicate that A is most similar to B(1,:).

In reality B has ~50,000 rows, and A and B both have about 800 columns. My current code to find the most similar row is below:

min(sum(xor(repmat(A,B_rows,1),B),2));

That works, but it's very slow. Any insights into which function is taking so long and ways to improve it?

Robert Seifert · Accepted Answer · 2014-01-14T18:29:44.773

There are 6 more or less similar approaches with bsxfun, hard to tell which one is the fastest.

%example data
A = [0 1 1 1 0 0];
B = perms(A);     %720x6 matrix

Counting similarities:

Z = sum(  bsxfun(@eq,  A,B) , 2);
Z = sum( ~bsxfun(@ne,  A,B) , 2);
Z = sum( ~bsxfun(@xor, A,B) , 2);

Counting differences:

Z = sum( ~bsxfun(@eq,  A,B) , 2);
Z = sum(  bsxfun(@ne,  A,B) , 2);
Z = sum(  bsxfun(@xor, A,B) , 2);

Z is a vector containing the number of equal/unequal elements of A for every row of B.

Benchmark for 100 trials each (same order as above):

score 3 · Answer 2 · answered Jan 14 '14 at 17:10

3

Use pdist2 with 'hamming' distance

[D I] = pdist2( A, B, 'hamming', 'Smallest', 1 );

answered Jan 14 '14 at 17:10

Shai

111,146
38
238
371

In case it's not faster, it's at least a lot easier to understand. – Jonas Jan 14 '14 at 17:40
Ahh ok that's a new function for me. Thanks for all the answers everyone, this was very helpful. – user1956609 Jan 14 '14 at 18:40
@Shai: can you explain your solution, what is better compared to the others? (at)user1956609 you have chosen the slowest, any reason for that? I'm just curios :) – Robert Seifert Jan 14 '14 at 18:48
Yeah the reason is I misread the decimal point and just sat through an hour of this running, ha. Thanks for the update, trying the others soon. – user1956609 Jan 14 '14 at 20:41

score 2 · Answer 3 · edited Jun 20 '20 at 09:12

Benchmark for all solutions

(Thanks to Amro for the Benchmark-code)

function [t] = bench()
    A = [0 1 1 1 0 0];
    B = perms(A);

    % functions to compare
    fcns = {
        @() compare1(A,B);
        @() compare2(A,B);
        @() compare3(A,B);
        @() compare4(A,B);
    };

    % timeit
    t = zeros(4,1);
    for ii = 1:100;
        t = t + cellfun(@timeit, fcns);
    end
end

function Z = compare1(A,B)  %thewaywewalk
    Z = sum(  bsxfun(@eq,  A,B) , 2);
end
function Z = compare2(A,B)  %Matt J
    Z = sum(bsxfun(@xor, A, B),2);
end
function Z = compare3(A,B)  %Luis Mendo
    A = logical(A);
    Z = sum(B(:,~A),2) + sum(~B(:,A),2);
end
function Z = compare4(A,B)  %Shai
     Z = pdist2( A, B, 'hamming', 'Smallest', 1 );
end

result for 100 trials and a 720x6 matrix for B:

0.0068s   %thewaywewalk
0.0092s   %Matt J
0.0077s   %Luis Mendo
0.0399s   %Shai

result for 100 trials and a 40320x8 matrix for B:

0.0889s   %thewaywewalk
0.1512s   %Matt J
0.4571s   %Luis Mendo
0.4578s   %Shai

The bsxfun approach seems to be the fastest, and for that by using the anonymous function @eq.

you do not take into account that my solution gives as an output the nearest-neighbor: distance and index. It amounts to another call `[D I]=min(Z);` after each of all other candidates. I don't think this is going to change the timing results radically, but it may have an influence. Moreover, you might want to consider the effect of input type: setting `a — Shai, Jan 14 '14 at 21:20

score 1 · Answer 4 · answered Jan 14 '14 at 17:01

1

Sounds like the bsxfun command is what you need

min(sum(bsxfun(@xor, A, B),2))

answered Jan 14 '14 at 17:01

Matt J

1,127
7
15

Luis Mendo · Answer 5 · 2014-01-14T17:58:46.240

1

Another possibility, which replaces xor by indexing (not sure if it's faster than previous answers):

A = logical(A);
min(sum(B(:,~A),2) + sum(~B(:,A),2))

Or, to avoid ~ on B:

A = logical(A);
min(sum(B(:,~A),2) - sum(B(:,A),2)) + nnz(A)

edited Jan 14 '14 at 17:58

answered Jan 14 '14 at 17:40

Luis Mendo

110,752
13
76
147

Faster way to take xor between one row and a matrix of 1's and 0's?

5 Answers5

Benchmark for all solutions

Linked