Faster alternative to INTERSECT with 'rows' - MATLAB

Question

I have a code written in Matlab that uses 'intersect' to find the vectors (and their indices) that intersect in two large matrices. I found that 'intersect' is the slowest line (by a large difference) in my code. Unfortunately I couldn't find a faster alternative so far.

As an example running the code below takes approx 5 seconds on my pc:

profile on
for i = 1 : 500
    a = rand(10000,5);
    b = rand(10000,5);
    [intersectVectors, ind_a, ind_b] = intersect(a,b,'rows');
end
profile viewer

I was wondering if there is a faster way. Note that the matrices (a) and (b) have 5 columns. The number of rows don't necessary have to be the same for the two matrices.

Any help would be great. Thanks

Any other constraint, like maybe they are arrays of integers? Single digit numbers? — Divakar, Feb 13 '15 at 06:37
the numbers inside the matrices are integers, but stored as doubles — Jack, Feb 13 '15 at 07:55
do you know anything about the numbers? Are they e.g. sorted? — Jonas, Feb 13 '15 at 09:53
Hi Jonas. No, the numbers can take any combination. So far, I believe the work done by Divakar is pretty impressive and good enough for my work, however a faster way is always better if you want to investigate more. BTW, the numbers don't have to be stored as doubles. — Jack, Feb 15 '15 at 23:16

score 3 · Accepted Answer · edited May 23 '17 at 11:51

Discussion and solution codes

You can use an approach that leverages fast matrix multiplication in MATLAB to convert those 5 columns of input arrays into one column by considering each column as a significant "digit" of a single number. Thus, you would end up with an array with only column and then, you can use intersect or ismember without 'rows' and that must speedup the codes in a big way!

Here are the promised implementations as function codes for easy usage -

intersectrows_fast_v1.m:

function [intersectVectors, ind_a, ind_b] = intersectrows_fast_v1(a,b)

%// Calculate equivalent one-column versions of input arrays
mult = [10^ceil(log10( 1+max( [a(:);b(:)] ))).^(size(a,2)-1:-1:0)]'; %//'
acol1 = a*mult;
bcol1 = b*mult;

%// Use intersect without 'rows' option for a good speedup
[~, ind_a, ind_b] = intersect(acol1,bcol1);
intersectVectors = a(ind_a,:);

return;

intersectrows_fast_v2.m:

function [intersectVectors, ind_a, ind_b] = intersectrows_fast_v2(a,b)

%// Calculate equivalent one-column versions of input arrays
mult = [10^ceil(log10( 1+max( [a(:);b(:)] ))).^(size(a,2)-1:-1:0)]'; %//'
acol1 = a*mult;
bcol1 = b*mult;

%// Use ismember to get indices of the common elements
[match_a,idx_b] = ismember(acol1,bcol1);

%// Now, with ismember, duplicate items are not taken care of automatically as
%// are done with intersect. So, we need to find the duplicate items and
%// remove those from the outputs of ismember
[~,a_sorted_ind] = sort(acol1);
a_rm_ind =a_sorted_ind([false;diff(sort(acol1))==0]); %//indices to be removed
match_a(a_rm_ind)=0;

intersectVectors = a(match_a,:);
ind_a = find(match_a);
ind_b = idx_b(match_a);

return;

Quick tests and conclusions

With the datasizes listed in the question, the runtimes were -

-------------------------- With original approach
Elapsed time is 3.885792 seconds.
-------------------------- With Proposed approach - Version - I
Elapsed time is 0.581123 seconds.
-------------------------- With Proposed approach - Version - II
Elapsed time is 0.963409 seconds.

The results seem to suggest a big advantage in favour of the version - I of the two proposed approaches with a whooping speedup of around 6.7x over the original approach!!

Also, please note that if you don't need any one or two of the three outputs from the original intersect with 'rows' based approach, then both the proposed approaches could be further shortened for better runtime performances!

Faster alternative to INTERSECT with 'rows' - MATLAB

1 Answers1

Discussion and solution codes

Quick tests and conclusions

Linked