5

MATLAB's built-in function accumarray accepts a function fun as a fourth argument.

A = accumarray(subs,val,sz,fun);

This applies fun to each subset of elements in val that have identical subscripts in subs. The documentation however states:

If the subscripts in subs are not sorted with respect to their linear indices, fun should not depend on the order of the values in its input data.

How can we implement a stable version of accumarray, which doesn't have this limitation, but will guarantee that the subsets adopt the same order as given by val?

Example:

subs = [1:10,1:10];
val = 1:20;
accumarray(subs(:), val(:), [], @(x)x(end)).'

The expected output of this would be 11:20 if accumarray were stable. In fact the output is:

ans =
    11    12    13    14     5     6     7    18    19    20

Our implementation should yield:

accumarrayStable(subs(:), val(:), [], @(x)x(end)).'`
ans =
    11    12    13    14    15    16    17    18    19    20
knedlsepp
  • 6,065
  • 3
  • 20
  • 41

1 Answers1

5

We can use sortrows as a preprocessing step to sort the indices and corresponding values first, as its documentation states:

SORTROWS uses a stable version of quicksort.

As the subscripts in subs should be sorted with respect to their linear indices, we need to sort them in reverse lexicographic order. This can be achieved by flipping the column ordering before and after using sortrows.

This gives us the following code for a stable version of accumarray:

function A = accumarrayStable(subs, val, varargin)
[subs(:,end:-1:1), I] = sortrows(subs(:,end:-1:1));
A = accumarray(subs, val(I), varargin{:});

Alternative:

As suggested by Luis Mendo, instead of sortrows one could also generate the linear indices from the subscripts and use sort instead.

function A = accumarrayStable(subs, val, varargin)
if numel(varargin)>0 && ~isempty(varargin{1})
    sz = varargin{1};
else
    sz = max(subs,[],1);
end
[~, I] = sort(subs*cumprod([1,sz(1:end-1)]).');
A = accumarray(subs(I,:), val(I), sz, varargin{:});

Note that we should be using 1+(subs-1)*cumprod([1,sz(1:end-1)]).' for the conversion to linear indices. We leave out the +1 and -1 as the result of sort will still be the same; which saves us a few cycles.

Which one of the above solutions is faster will depend on your machine and MATLAB version. You could test for example via:

A = randi(10, 1e4, 5); 
timeit(@()accumarrayStable(A(:,1:end-1), A(:,end), [], @(x)x(1))
knedlsepp
  • 6,065
  • 3
  • 20
  • 41
  • I was going to suggest `sortrows` regarding [your comment](http://stackoverflow.com/a/28460678/2586922) (it's used in [@Divakar's answer](http://stackoverflow.com/a/28460420/2586922) too). Good Q&A! – Luis Mendo Feb 11 '15 at 20:10
  • @LuisMendo: I figured someone might need this sometime, so why not answer my own question for a change. :-) – knedlsepp Feb 11 '15 at 20:12
  • 1
    Reversing the indices before feeding to `sortrows` is needed so that the sorting is done in the linear-indexing sense, right? Maybe you could add that to your answer – Luis Mendo Feb 11 '15 at 20:13
  • A possibly faster approach would be to explicitly compute the linear indices (which can be done very effficiently with matrix multiplication) and use `sort` instead of `sortrows` – Luis Mendo Feb 11 '15 at 20:34
  • @LuisMendo: I was thinking the same, but then we would need to find out the matrix dimensions first using `max`, which I thought would add more complexity than necessary. I'll give it a try. – knedlsepp Feb 11 '15 at 20:36
  • @LuisMendo: Did you mean along those lines: http://ideone.com/83TSFx ? I haven't done extensive testing yet, but it seems slower than the current answer. (I could get rid of the anonymous function and make a comment instead.) Do you have any suggestions? – knedlsepp Feb 11 '15 at 21:11
  • Yes, anonymous functions are slow. Also, you can get rid of those `-1` and `+1` to gain a little. http://ideone.com/hUNzqV – Luis Mendo Feb 11 '15 at 21:24
  • Oops, I forgot `cumprod` and the user-specified size. But you get the idea. Also, it could be that `reshape` with linear indices (like you do) is faster or slower than using the original subindices... who knows – Luis Mendo Feb 11 '15 at 21:25
  • @LuisMendo: I guess the `sortrows` implementation is quite good. I still get better performance with the initial solution, than with the [updated one](http://ideone.com/H483aa). – knedlsepp Feb 11 '15 at 21:47
  • For me the second version is faster: http://ideone.com/RBFLe7 (R2014b), http://ideone.com/E3QruZ (R2010b). So... maybe post both versions? – Luis Mendo Feb 11 '15 at 22:13
  • 1
    @LuisMendo: I will add the second version. It seems it has something to do that I was testing with `fun = @(x)x(end)`, which for whatever reason slows down version2 a lot more than version 1... We should probably use a test case where `fun~=sum`, otherwise the stable stuff doesn't make any sense anyway. – knedlsepp Feb 11 '15 at 22:25
  • I tend to agree that `sum` should be the reference (it's the default function applied by `accumarray`, and one of the most common). But the "stability" actually makes more sense for `@(x) x(end)` than for `sum`, `max` etc ( where order doesn't matter). For `@(x) x(end)` the gain is smaller, but there's still some: http://ideone.com/njHPfV – Luis Mendo Feb 11 '15 at 22:44
  • @LuisMendo: Ok. I'll just leave both versions and the test; as on my Macbook I consistently get the opposite results `ans = 0.0523` vs. `ans = 0.0583`. Thank you for the contribution! ;-) – knedlsepp Feb 11 '15 at 22:54
  • 1
    Now I see I got your comment about _not_ using `sum` wrong :-) So we agree that stability makes sense for "non-standard" functions – Luis Mendo Feb 11 '15 at 22:57
  • `sortrows` also takes an argument specifying column order, so instead of `[subs(:,end:-1:1), I] = sortrows(subs(:,end:-1:1));` you can use `[subs, I] = sortrows(subs, size(subs,2):-1:1);` – KQS Feb 11 '16 at 21:23