3

What is the fastest way of finding and multiplying repeated values between them in an array?

Example:

a = [ 2 2 3 5 11 11 17 ]

Result:

a = [ 4 3 5 121 17 ]

I can think of iterative ways (by finding the hist, iterating through bins, ...) , but is there a vectorized/fast approach?

Robert Seifert
  • 25,078
  • 11
  • 68
  • 113
Ander Biguri
  • 35,140
  • 11
  • 74
  • 120
  • As I just had a look at the other answers: What is the expected output for `a = [2 2 3 3 2 2]`? Divakar's answer yields `[4,9,4]`, whereas thewaywewalk's answer yields `[16,9]`. – knedlsepp Mar 31 '15 at 15:44
  • 1
    @knedlsepp To be honest, it wont happen. But well, chose the one you prefer. I would say 2nd option fits more in what I was doing, but it doesn't matter. Nice you noticed it though. – Ander Biguri Mar 31 '15 at 15:49

3 Answers3

6

Prospective approach and Solution Code

Seems like the posted problem would be a good fit for accumarray -

%// Starting indices of each "group"
start_ind = find(diff([0 ; a(:)]))

%// Setup IDs for each group
id = zeros(1,numel(a)) %// Or id(numel(a))=0 for faster pre-allocation
id(start_ind) = 1

%// Use accumarray to get the products of elements within the same group
out = accumarray(cumsum(id(:)),a(:),[],@prod)

For a non monotonically increasing input, you need to add two more lines of code -

[~,sorted_idx] = ismember(sort(start_ind),start_ind)
out = out(sorted_idx)

Sample run -

>> a
a =
     2     2     3     5    11    11    17     4     4     1     1     1     7     7
>> out.'
ans =
     4     3     5   121    17    16     1    49

Tweaky-Squeaky

Now, one can make use of logical indexing to remove find and also use the faster pre-allocation scheme to give the proposed approach a super-boost and give us a tweaked code -

id(numel(a))=0;
id([true ; diff(a(:))~=0])=1;
out = accumarray(cumsum(id(:)),a(:),[],@prod);

Benchmarking

Here's the benchmarking code that compares all the proposed approaches posted thus far for the stated problem for runtimes -

%// Setup huge random input array
maxn = 10000;
N = 100000000;
a = sort(randi(maxn,1,N));

%// Warm up tic/toc.
for k = 1:100000
    tic(); elapsed = toc();
end

disp('------------------------- With UNIQUE')
tic
ua = unique(a);
out = ua.^histc(a,ua);
toc, clear ua out

disp('------------------------- With ACCUMARRAY')
tic
id(numel(a))=0;
id([true ; diff(a(:))~=0])=1;
out = accumarray(cumsum(id(:)),a(:),[],@prod);
toc, clear out id

disp('------------------------- With FOR-LOOP')
tic
b = a(1);
for k = 2:numel(a)
    if a(k)==a(k-1)
        b(end) = b(end)*a(k);
    else
        b(end+1) = a(k);
    end
end
toc

Runtimes

------------------------- With UNIQUE
Elapsed time is 3.050523 seconds.
------------------------- With ACCUMARRAY
Elapsed time is 1.710499 seconds.
------------------------- With FOR-LOOP
Elapsed time is 1.811323 seconds.

Conclusions: The runtimes it seems, support the idea of accumarray over the two other approaches!

Community
  • 1
  • 1
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • I was expecting a nice answer like this from you Divakar! – Ander Biguri Mar 30 '15 at 14:18
  • @AnderBiguri Ah me too! ;) – Divakar Mar 30 '15 at 14:18
  • @Divakar: What version of MATLAB are you running? Copy pasting your code I get entirely different results. (Macbook Pro 4GB RAM, R2014a): `unique`: 12.3 seconds, `accumarray`: 9.8 seconds, `for`: 2.9 seconds. – knedlsepp Mar 31 '15 at 18:53
  • It seems `N=1e8` is too much memory for my machine for `accumarray`. `N=1e7` results are more similar, yet still the `for` loop beats `accumarray` here: `unique`: 0.94s, `accumarray`: 0.31s, `for`: 0.26s. – knedlsepp Mar 31 '15 at 18:56
  • @knedlsepp I am on 2015A, 16GB RAM. Make it `N = 1000000` and `maxn = 100`? The idea is to use such datasizes for which tic-toc results in more than 1 sec runtimes, that lessens the effect of variations that come up with using tic-toc. – Divakar Mar 31 '15 at 19:00
  • @Divakar: Ok, this is obviously due to the lower memory overhead of the loop approach then. The slight performance difference of `accumarray` and `for` could be version or hardware related. However it's surprising to me how good the JIT acceleration has become. When I was first taught MATLAB the general consensus was basically: Never ever use a `for`-loop unless you really have to. It's great to see that has changed, as sometimes the loopy approach is so much easier to understand. – knedlsepp Mar 31 '15 at 19:09
  • @knedlsepp Agreed, it's hard to beat loopy codes at times these days! Hardware does play a good role anyway. Well, I wouldn't know what they teach in MATLAB classes, never been to one of those! – Divakar Mar 31 '15 at 19:12
6

Using histc and unique:

ua = unique(a)
out = ua.^histc(a,ua)

out =

     4     3     5   121    17

Considering the case, that the vector a is not monotonically increasing, it gets a little more complicated:

%// non monotonically increasing vector
a = [ 2 2 3 5 11 11 17 4 4 1 1 1 7 7]

[ua, ia] = unique(a)             %// get unique values and sort as required for histc  
[~, idx] = ismember(sort(ia),ia) %// get original order
hc = histc(a,ua)                 %// count occurences
prods = ua.^hc                   %// calculate products
out = prods(idx)                 %// reorder to original order

or:

ua = unique(a,'stable')          %// get unique values in original order
uas = unique(a)                  %// get unique values sorted as required for histc  
[~,idx] = ismember(ua,uas)       %// get indices of original order
hc = histc(a,uas)                %// count occurences
out = ua.^hc(idx)                %// calculate products and reorder 

out =

     4     3     5   121    17    16     1    49

Seems still a good solution as accumarray also doesn't offer a stable version by default.

Community
  • 1
  • 1
Robert Seifert
  • 25,078
  • 11
  • 68
  • 113
3

You might be surprised how well a simple for-loop compares in terms of speed:

b = a(1);
for k = 2:numel(a)
    if a(k)==a(k-1)
        b(end) = b(end)*a(k);
    else
        b(end+1) = a(k);
    end
end

Even without doing any preallocation this performs on par with the accumarray solution.

knedlsepp
  • 6,065
  • 3
  • 20
  • 41