7

I have 2 big arrays A and b:

A: 10.000++ rows, 4 columns, not unique integers
b: vector with 500.000++ elements, unique integers

Due to the uniqueness of the values of b, I need to find the only index of b, where A(i,j) == b.

What I started with is

[rows,columns] = size(A);
B = zeros(rows,columns);
for i = 1 : rows
    for j = 1 : columns
        B(i,j) = find(A(i,j)==b,1);
    end
end

This takes approx 5.5 seconds to compute, which is way to long, since A and b can be significantly bigger... That in mind I tried to speed up the code by using logical indexing and reducing the for-loops

[rows,columns] = size(A);
B = zeros(rows,columns);
for idx = 1 : numel(b)
    B(A==b(idx)) = idx;
end

Sadly this takes even longer: 21 seconds

I even tried to do use bsxfun

for i = 1 : columns
   [I,J] = find(bsxfun(@eq,A(:,i),b))
    ... stitch B together ...
end

but with a bigger arrays the maximum array size is quickly exceeded (102,9GB...).

Can you help me find a faster solution to this? Thanks in advance!

EDIT: I extended find(A(i,j)==b,1), which speeds up the algorithm by factor 2! Thank you, but overall still too slow... ;)

Cris Luengo
  • 55,762
  • 10
  • 62
  • 120
Johannes
  • 135
  • 5

2 Answers2

4

The function ismember is the right tool for this:

[~,B] = ismember(A,b);

Test code:

function so
  A = rand(1000,4);
  b = unique([A(:);rand(2000,1)]);

  B1 = op1(A,b);
  B2 = op2(A,b);
  isequal(B1,B2)

  tic;op1(A,b);op1(A,b);op1(A,b);op1(A,b);toc
  tic;op2(A,b);op2(A,b);op2(A,b);op2(A,b);toc
end

function B = op1(A,b)
  B = zeros(size(A));
  for i = 1:numel(A)
    B(i) = find(A(i)==b,1);
  end
end

function B = op2(A,b)
  [~,B] = ismember(A,b);
end

I ran this on Octave, which is not as fast with loops as MATLAB. It also doesn't have the timeit function, hence the crappy timing using tic/toc (sorry for that). In Octave, op2 is more than 100 times faster than op1. Timings will be different in MATLAB, but ismember should still be the fastest option. (Note I also replaced your double loop with a single loop, this is the same but simpler and probably faster.)

If you want to repeatedly do the search in b, it is worthwhile to sort b first, and implement your own binary search. This will avoid the checks and sorting that ismember does. See this other question.

Cris Luengo
  • 55,762
  • 10
  • 62
  • 120
  • 4
    It should be noted that the use of [`ismember`](https://www.mathworks.com/help/matlab/ref/ismembertol.html) comes with the obligatory floating point comparison caveat. [`ismembertol`](https://www.mathworks.com/help/matlab/ref/ismembertol.html) was introduced in R2015a to mitigate these types of issues. – sco1 May 31 '18 at 18:56
2

Assuming that you have positive integers you can use array indexing:

mm = max(max(A(:)),max(b(:)));
idxs = sparse(b,1,1:numel(b),mm,1);
result = full(idxs(A));

If the range of values is small you can use dense matrix instead of sparse matrix:

mm = max(max(A(:)),max(b(:)));
idx = zeros(mm,1);
idx(b)=1:numel(b);
result = idx(A);
rahnema1
  • 15,264
  • 3
  • 15
  • 27