Turning a binary matrix into a vector of the last nonzero index in a fast, vectorized fashion

Question

Suppose, in MATLAB, that I have a matrix, A, whose elements are either 0 or 1.

How do I get a vector of the index of the last non-zero element of each column in a faster, vectorized way?

I could do

[B, I] = max(cumsum(A));

and use I, but is there a faster way? (I'm assuming cumsum would cost a bit of time even suming 0's and 1's).

Edit: I guess that I vectorized even more than I need fast - Mr. Fooz' loop is great but each loop in MATLAB seems to cost me a lot in debugging time even if it is fast.

Mr Fooz · Answer 1 · 2012-01-28T19:40:45.010

11

Fast is what you should worry about, not necessarily full vectorization. Recent versions of Matlab are much smarter about handling loops efficiently. If there's a compact vectorized way of expressing something, it's usually faster, but loops should not (always) be feared like they used to be.

clc

A = rand(5000)>0.5;
A(1,find(sum(A,1)==0)) = 1; % make sure there is at least one match

% Slow because it is doing too much work
tic;[B,I1]=max(cumsum(A));toc

% Fast because FIND is fast and it runs the inner loop
tic;
I3=zeros(1,5000);
for i=1:5000
  I3(i) = find(A(:,i),1,'last');
end
toc;
assert(all(I1==I3));

% Even faster because the JIT in Matlab is smart enough now
tic;
I2=zeros(1,5000);
for i=1:5000
  I2(i) = 0;
  for j=5000:-1:1
    if A(j,i)
      I2(i) = j;
      break;
    end
  end
end
toc;
assert(all(I1==I2));

On R2008a, Windows, x64, the cumsum version takes 0.9 seconds. The loop and find version takes 0.02 seconds. The double loop version takes a mere 0.001 seconds.

EDIT: Which one is fastest depends on the actual data. The double-loop takes 0.05 seconds when you change the 0.5 to 0.999 (because it takes longer to hit the break; on average). cumsum and the loop&find implementation have more consistent speeds.

EDIT 2: gnovice's flipud solution is clever. Unfortunately, on my test machine it takes 0.1 seconds, so it's much faster than cumsum, but slower than the looped versions.

edited Jan 28 '12 at 19:40

answered May 06 '09 at 21:52

Mr Fooz

109,094
6
73
101

Wow, are qualities of your loop that make it faster or would any such loop be the fastest way to do any similar operation? – Joe Soul-bringer May 07 '09 at 04:45
1

When I started writing the examples, I expected the double-loop to be slowest and loop&find to be fastest. When the inner loop has to run to completion, it is somewhat slow (see edit 2). These days, Matlab does a just-in-time compilation of every function. This makes loops much faster (but has some unexpected consequences for people who like to use EVAL). In general, vectorization is still better to use if you can do it without doing extra work (the flipud and cumsum solutions are vectorized but do extra work). – Mr Fooz May 07 '09 at 13:46
Another interesting thing to note is that in many cases recent versions of Matlab are smart about expression parsing. E=A.*B.*C.*D; will be executed with no extra temporaries, element-by-element, similar to the way you would do it if manually writing the operation in C. With multicore support enabled, Matlab tries to find disjoint parts of the operation and farm them out to different CPU cores. I don't know if it's smart enough to divide up independent loop iterations amongst the calls. For the tests I did, I used a Core 2 Duo proc. – Mr Fooz May 07 '09 at 13:49
On a multicore machine, you may even be able to incorporate a PARFOR (parallel for loop) for your outer loop, since the inner loop operates independently on each column. – gnovice May 07 '09 at 15:43
I tried parfor, as suggested by gnovice. The first time it took nearly a second (probably because it had to load in parfor logic and/or create the worker thread pool). After a few runs, it drops down to 0.04 seconds. So at least on a simple loop like this one with only 5000 iterations on a mere dual-core processor, the serial version is much faster. – Mr Fooz May 07 '09 at 19:58
Hmmm... I guess the work being done in the loop is so minimal that it is dwarfed by the extra overhead incurred from parallelization. – gnovice May 08 '09 at 14:47
Agreed. Micro parallelization often doesn't work well on CPUs. – Mr Fooz May 08 '09 at 15:46

score 7 · Accepted Answer · edited May 23 '17 at 10:31

As shown by Mr Fooz, for loops can be pretty fast now with newer versions of MATLAB. However, if you really want to have compact vectorized code, I would suggest trying this:

[B,I] = max(flipud(A));
I = size(A,1)-I+1;

This is faster than your CUMSUM-based answer, but still not quite as fast as Mr Fooz's looping options.

Two additional things to consider:

What results do you want to get for a column that has no ones in it at all? With the above option I gave you, I believe you will get an index of size(A,1) (i.e. the number of rows in A) in such a case. For your option, I believe you will get a 1 in such a case, while the nested-for-loops option from Mr Fooz will give you a 0.
The relative speed of these different options will likely vary based on the size of A and the number of non-zeroes you expect it to have.

Clever idea. Unfortunately, it's about 5x slower than loop&find. — Mr Fooz, May 06 '09 at 23:24
That's kinda the result I expected: faster than CUMSUM but still slower than looping... although it's all still dependent on the size and fill-fraction of A (which the OP didn't really define). — gnovice, May 07 '09 at 00:03

Turning a binary matrix into a vector of the last nonzero index in a fast, vectorized fashion

2 Answers2

Linked