4

I have a matrix with percantage values where every row represents an individual observation. I need to compute the cumulative product where these values correspond to the same subscript. I tried to use the accumarray function, which works fine and as expected as long I use a column vector as values (rather than a matrix). I am wondering what is the best way to solve my problem without looping through the individual columns of my value matrix?

Here's my sample code:

subs = [1;1;1;2;2;2;2;2;3;3;4;4;4];
vals1 = [0.1;0.05;0.2;0.02;0.09;0.3;0.01;0.21;0.12;0.06;0.08;0.12;0.05];

% This is working as expected
result1 = accumarray(subs,vals1, [], @(x) prod(1+x) -1)


vals2 = [vals1,vals1];

% This is not working as the second input parameter of accumarray
% apperently must be a vector (rather than a matrix)
result2 = accumarray(subs, vals2, [], @(x) prod(1+x) -1)
Andi
  • 3,196
  • 2
  • 24
  • 44

2 Answers2

2

For vals you can set it as 1:size(vals2,1) and use it to extract rows of vals2. Also it is required for the function to return cell.

result2 = accumarray(subs, 1:size(vals2,1), [], @(x) {prod(1+vals2(x,:),1)-1})

You can concatenate cell elements:

result3 = vertcat(result2{:})

Or all in one line:

result3 = cell2mat( accumarray(subs, 1:size(vals2,1), [], @(x) {prod(1+vals2(x,:),1)-1}))

result3 =

   0.38600   0.38600
   0.76635   0.76635
   0.18720   0.18720
   0.27008   0.27008

Result of a test in Octave comparing three proposed methods using a [10000 x 200] matrix as input:

subs = randi(1000,10000,1);
vals2 = rand(10000,200);

=========CELL2MAT========
Elapsed time is 0.130961 seconds.
=========NDGRID========
Elapsed time is 3.96383 seconds.
=========FOR LOOP========
Elapsed time is 6.16265 seconds.

Online Demo

rahnema1
  • 15,264
  • 3
  • 15
  • 27
  • That doesn't look like an easy solution. It's working but in imho it makes the accumarray function less readable. In that case, I think I prefer a simple for-loop solution.
    `for i = 1 : size(vals2, 2)` `result2(:,i) = accumarray(subs, vals2(:,i), [], @(x) prod(1+x) -1);` `end`
    – Andi Nov 29 '17 at 13:16
  • for-loop solution may be inefficient when number of columns is high. – rahnema1 Nov 29 '17 at 13:27
  • Hmm, in my case, number of columns can be as high as 10,000. – Andi Nov 29 '17 at 13:40
  • @rahnema1 I think the for loop solution posted by OP is not efficient. I implemented an improved for loop without accumarray which is performing as fast as the cell2mat. I modified your script and added the improved for loop solution. Run it a few times and you will see the perf is similar to cell2mat. https://rextester.com/RLYKA32361 . Honestly, I am a bit surprised that the simple for loop is matching accumarray. Let me know if I am missing something. – Turbo Jan 02 '21 at 10:49
  • @Turbo [Here](https://rextester.com/AFDO33273) I edited the code and used a random data set for `subs`. There are `1500` unique categories each repeated at most 500 times. I commented out `ndgrid` because it requires much of memory. The improved loop is better than non improved loop but it cannot outperform `cell2mat`. It would be better to use `repelem` to create the data-set but that version of Octave doesn't contain `repelem`. – rahnema1 Jan 02 '21 at 15:52
  • Thanks! In short, increasing the size of the subs & vals brings out the difference (between "for loop improved" and "cell2mat"). – Turbo Jan 03 '21 at 00:47
0

You need to add a second set of subscripts to subs (so that it is N-by-2) to handle your 2D data, which still has to be passed as an N-element vector (i.e. one element for each row in subs). You can generate the new set of 2D subscripts using ndgrid:

[subs1, subs2] = ndgrid(subs, 1:size(vals2, 2));
result2 = accumarray([subs1(:) subs2(:)], vals2(:), [], @(x) prod(1+x) -1)

And the result with your sample data:

result2 =

    0.3860    0.3860
    0.7664    0.7664
    0.1872    0.1872
    0.2701    0.2701
gnovice
  • 125,304
  • 15
  • 256
  • 359
  • How would you define the output arguments of `ngrid` if the number of columns in `vals2` is variable? – Andi Nov 30 '17 at 11:12
  • @Andi: That's exactly what I do in the above code. Note that the second argument to `ndgrid` is a vector from 1 to `size(vals2, 2)` (i.e. the number of columns in `vals2`). – gnovice Nov 30 '17 at 15:48