Fastest way of finding and multiplying repeated values in array

Question

What is the fastest way of finding and multiplying repeated values between them in an array?

Example:

a = [ 2 2 3 5 11 11 17 ]

Result:

a = [ 4 3 5 121 17 ]

I can think of iterative ways (by finding the hist, iterating through bins, ...) , but is there a vectorized/fast approach?

As I just had a look at the other answers: What is the expected output for `a = [2 2 3 3 2 2]`? Divakar's answer yields `[4,9,4]`, whereas thewaywewalk's answer yields `[16,9]`. — knedlsepp, Mar 31 '15 at 15:44
@knedlsepp To be honest, it wont happen. But well, chose the one you prefer. I would say 2nd option fits more in what I was doing, but it doesn't matter. Nice you noticed it though. — Ander Biguri, Mar 31 '15 at 15:49

score 6 · Answer 1 · edited Jun 20 '20 at 09:12

Prospective approach and Solution Code

Seems like the posted problem would be a good fit for accumarray -

%// Starting indices of each "group"
start_ind = find(diff([0 ; a(:)]))

%// Setup IDs for each group
id = zeros(1,numel(a)) %// Or id(numel(a))=0 for faster pre-allocation
id(start_ind) = 1

%// Use accumarray to get the products of elements within the same group
out = accumarray(cumsum(id(:)),a(:),[],@prod)

For a non monotonically increasing input, you need to add two more lines of code -

[~,sorted_idx] = ismember(sort(start_ind),start_ind)
out = out(sorted_idx)

Sample run -

>> a
a =
     2     2     3     5    11    11    17     4     4     1     1     1     7     7
>> out.'
ans =
     4     3     5   121    17    16     1    49

Tweaky-Squeaky

Now, one can make use of logical indexing to remove find and also use the faster pre-allocation scheme to give the proposed approach a super-boost and give us a tweaked code -

id(numel(a))=0;
id([true ; diff(a(:))~=0])=1;
out = accumarray(cumsum(id(:)),a(:),[],@prod);

Benchmarking

Here's the benchmarking code that compares all the proposed approaches posted thus far for the stated problem for runtimes -

%// Setup huge random input array
maxn = 10000;
N = 100000000;
a = sort(randi(maxn,1,N));

%// Warm up tic/toc.
for k = 1:100000
    tic(); elapsed = toc();
end

disp('------------------------- With UNIQUE')
tic
ua = unique(a);
out = ua.^histc(a,ua);
toc, clear ua out

disp('------------------------- With ACCUMARRAY')
tic
id(numel(a))=0;
id([true ; diff(a(:))~=0])=1;
out = accumarray(cumsum(id(:)),a(:),[],@prod);
toc, clear out id

disp('------------------------- With FOR-LOOP')
tic
b = a(1);
for k = 2:numel(a)
    if a(k)==a(k-1)
        b(end) = b(end)*a(k);
    else
        b(end+1) = a(k);
    end
end
toc

Runtimes

------------------------- With UNIQUE
Elapsed time is 3.050523 seconds.
------------------------- With ACCUMARRAY
Elapsed time is 1.710499 seconds.
------------------------- With FOR-LOOP
Elapsed time is 1.811323 seconds.

Conclusions: The runtimes it seems, support the idea of accumarray over the two other approaches!

@Divakar: What version of MATLAB are you running? Copy pasting your code I get entirely different results. (Macbook Pro 4GB RAM, R2014a): `unique`: 12.3 seconds, `accumarray`: 9.8 seconds, `for`: 2.9 seconds. — knedlsepp, Mar 31 '15 at 18:53
It seems `N=1e8` is too much memory for my machine for `accumarray`. `N=1e7` results are more similar, yet still the `for` loop beats `accumarray` here: `unique`: 0.94s, `accumarray`: 0.31s, `for`: 0.26s. — knedlsepp, Mar 31 '15 at 18:56
@knedlsepp I am on 2015A, 16GB RAM. Make it `N = 1000000` and `maxn = 100`? The idea is to use such datasizes for which tic-toc results in more than 1 sec runtimes, that lessens the effect of variations that come up with using tic-toc. — Divakar, Mar 31 '15 at 19:00
@Divakar: Ok, this is obviously due to the lower memory overhead of the loop approach then. The slight performance difference of `accumarray` and `for` could be version or hardware related. However it's surprising to me how good the JIT acceleration has become. When I was first taught MATLAB the general consensus was basically: Never ever use a `for`-loop unless you really have to. It's great to see that has changed, as sometimes the loopy approach is so much easier to understand. — knedlsepp, Mar 31 '15 at 19:09
@knedlsepp Agreed, it's hard to beat loopy codes at times these days! Hardware does play a good role anyway. Well, I wouldn't know what they teach in MATLAB classes, never been to one of those! — Divakar, Mar 31 '15 at 19:12

score 6 · Accepted Answer · edited May 23 '17 at 12:22

6

Using histc and unique:

ua = unique(a)
out = ua.^histc(a,ua)

out =

     4     3     5   121    17

Considering the case, that the vector a is not monotonically increasing, it gets a little more complicated:

%// non monotonically increasing vector
a = [ 2 2 3 5 11 11 17 4 4 1 1 1 7 7]

[ua, ia] = unique(a)             %// get unique values and sort as required for histc  
[~, idx] = ismember(sort(ia),ia) %// get original order
hc = histc(a,ua)                 %// count occurences
prods = ua.^hc                   %// calculate products
out = prods(idx)                 %// reorder to original order

or:

ua = unique(a,'stable')          %// get unique values in original order
uas = unique(a)                  %// get unique values sorted as required for histc  
[~,idx] = ismember(ua,uas)       %// get indices of original order
hc = histc(a,uas)                %// count occurences
out = ua.^hc(idx)                %// calculate products and reorder

out =

     4     3     5   121    17    16     1    49

Seems still a good solution as accumarray also doesn't offer a stable version by default.

edited May 23 '17 at 12:22

Community

1
1

answered Mar 30 '15 at 14:12

Robert Seifert

25,078
11
68
113

OH, well, Divakar's answer is cool, but this one is short and clear! Very nice. – Ander Biguri Mar 30 '15 at 14:15
2

I am going to accept this answer for the length and clarity of the code, however both of them are really good. – Ander Biguri Mar 30 '15 at 14:20
@AnderBiguri Have a look at my edit, I don't know if it applies to your case. – Robert Seifert Mar 30 '15 at 14:25
Very nice! In my case they are monotonically increasing, but this is very nice for the community. – Ander Biguri Mar 30 '15 at 14:26
Could you check the output for `a = [ 2 2 3 5 11 11 17 4 4 1 1 1 7 7]`? – Divakar Mar 30 '15 at 15:11
1

@Divakar thanks for your advice! I misused sort.... I stole your ismember approach now. My upvote you already had anyway :) – Robert Seifert Mar 30 '15 at 15:39
The non-monotonic version doesn't work for me with `a = randi(30,1000,1);` Haven't found the bug yet though. – knedlsepp Mar 31 '15 at 15:16
1

Oh. I guess the problem is the following: Multiple non-consecutive occurrences of the same integer: `a = [2 2 3 3 2 2]`. – knedlsepp Mar 31 '15 at 15:19
@knedlsepp that is true. that won't work, but is also far from the OPs request. – Robert Seifert Apr 01 '15 at 11:13
@thewaywewalk: Yup, he clarified that this case won't occur, so it doesn't matter. In this sense it is not a bug, but rather one possible interpretation of the problem. – knedlsepp Apr 01 '15 at 12:24

knedlsepp · Answer 3 · 2015-03-31T19:00:29.353

3

You might be surprised how well a simple for-loop compares in terms of speed:

b = a(1);
for k = 2:numel(a)
    if a(k)==a(k-1)
        b(end) = b(end)*a(k);
    else
        b(end+1) = a(k);
    end
end

Even without doing any preallocation this performs on par with the accumarray solution.

edited Mar 31 '15 at 19:00

answered Mar 31 '15 at 15:22

knedlsepp

6,065
3
20
41

Added few tweaks and benchmarks for more surprises :) Good alternative solution though! – Divakar Mar 31 '15 at 18:45

Fastest way of finding and multiplying repeated values in array

3 Answers3

Prospective approach and Solution Code

Tweaky-Squeaky

Benchmarking