9

I commonly need to summarize a time series with irregular timing with a given aggregation function (i.e., sum, average, etc.). However, the current solution that I have seems inefficient and slow.

Take the aggregation function:

function aggArray = aggregate(array, groupIndex, collapseFn)

groups = unique(groupIndex, 'rows');
aggArray = nan(size(groups, 1), size(array, 2));

for iGr = 1:size(groups,1)
    grIdx = all(groupIndex == repmat(groups(iGr,:), [size(groupIndex,1), 1]), 2);
    for iSer = 1:size(array, 2)
      aggArray(iGr,iSer) = collapseFn(array(grIdx,iSer));
    end
end

end

Note that both array and groupIndex can be 2D. Every column in array is an independent series to be aggregated, but the columns of groupIndex should be taken together (as a row) to specify a period.

Then when we bring an irregular time series to it (note the second period is one base period longer), the timing results are poor:

a = rand(20006,10);
b = transpose([ones(1,5) 2*ones(1,6) sort(repmat((3:4001), [1 5]))]);

tic; aggregate(a, b, @sum); toc
Elapsed time is 1.370001 seconds.

Using the profiler, we can find out that the grpIdx line takes about 1/4 of the execution time (.28 s) and the iSer loop takes about 3/4 (1.17 s) of the total (1.48 s).

Compare this with the period-indifferent case:

tic; cumsum(a); toc
Elapsed time is 0.000930 seconds.

Is there a more efficient way to aggregate this data?


Timing Results

Taking each response and putting it in a separate function, here are the timing results I get with timeit with Matlab 2015b on Windows 7 with an Intel i7:

    original | 1.32451
      felix1 | 0.35446
      felix2 | 0.16432
    divakar1 | 0.41905
    divakar2 | 0.30509
    divakar3 | 0.16738
matthewGunn1 | 0.02678
matthewGunn2 | 0.01977

Clarification on groupIndex

An example of a 2D groupIndex would be where both the year number and week number are specified for a set of daily data covering 1980-2015:

a2 = rand(36*52*5, 10);
b2 = [sort(repmat(1980:2015, [1 52*5]))' repmat(1:52, [1 36*5])'];

Thus a "year-week" period are uniquely identified by a row of groupIndex. This is effectively handled through calling unique(groupIndex, 'rows') and taking the third output, so feel free to disregard this portion of the question.

David Kelley
  • 1,418
  • 3
  • 21
  • 35
  • For each group, your code has to do a bunch of crap that's O(n) where n is size of whole data matrix. The line `grIdx = all(groupIndex == repmat(groups(iGr,:), [size(groupIndex,1), 1]), 2);` isn't going to be fast. I struggled with a similar problem: I had a matrix of data and a column vector signifying which group a row (of the data matrix) was a member of. For each group, I wanted to pull the group's data and do some calculations. I ended up writing a mex function in c++ that returned a cell array showing which group had data on which rows. – Matthew Gunn Nov 10 '15 at 18:33
  • If groupIndex is just a column vector, there's possibly some mex c++ code I could post that you might find useful. It takes a groupIndex vector and for each group, shows what rows of groupIndex that group is on. – Matthew Gunn Nov 10 '15 at 18:50
  • @MatthewGunn It'd be a start. But it won't replace the inner for-loop though, will it? I see the `grIdx` line as a definite part of the problem, but a good chunk of the execution time is spent on it `iSer` loop. – David Kelley Nov 10 '15 at 19:19
  • As long as each group has at least two observations, you could possibly replace that with: aggArray(iGr,:) = collapseFn(array(grIdx,:)) That would work with a lot of collapse functions like mean etc... but yeah, it's not as robust – Matthew Gunn Nov 10 '15 at 19:35
  • I was doing that until I started getting weird errors and added that. Might be worth adding an if statement for that. I'll have to check what the profiler says. – David Kelley Nov 10 '15 at 19:46
  • If your goal is to get something competitive with tic; sum(a); toc it's just not going to happen. (1) the algorithmic complexity of what you're trying to do is WAY higher because you need to essentially sort array by groupIndex! and (2) unless you write everything in c++, you're going to be hitting a lot of overhead associated with MATLAB as an interpreted language. – Matthew Gunn Nov 12 '15 at 23:29
  • What's the sizes of the input datasets? Would you be kind enough to benchmark the posted approaches? – Divakar Nov 13 '15 at 11:25
  • @Divakar: Benchmarks coming shortly. And I gave a 20000 element dataset as an example, but I'm just as interested in smaller arrays between 200-1000. – David Kelley Nov 13 '15 at 15:03
  • @DavidKelley Ops, I just made an edit in my solution with a possible performance improvement suggestion in it! Could you add that in the benchmarks? – Divakar Nov 13 '15 at 15:06
  • @Divakar Added as a third response for you by altering the second response. – David Kelley Nov 13 '15 at 16:05
  • @DavidKelley Thanks a lot! – Divakar Nov 13 '15 at 18:38
  • @DavidKelley Thanks a lot on the tests! Question for you - So, you are using `b` as `groupIndex`, which in your testcase is a 1D array, right? So that makes `unique(groupIndex, 'rows')` essentially same as `unique(groupIndex)`, right? In your problem, it says `groupIndex can be 2D`, so wouldn't it be fair to have a regular 2D array as `groupIndex` for benchmarking? – Divakar Nov 13 '15 at 18:54
  • @David Kelley to expand on Divakars question, I assume the group index can be different for different colummns of the array, but all columns of the group index would have the same number of groups, just in different order, e.g. like groupIndex=[1 1 2 3;1 2 2 3; 1 2 3 3]' for e.g. 3 columns? – Felix Darvas Nov 13 '15 at 19:09
  • See the edits I just made. I'm pretty sure it's immaterial at this point. – David Kelley Nov 13 '15 at 21:03
  • Would it be possible for you to share the benchmarking codes? – Divakar Nov 14 '15 at 10:28

5 Answers5

8

Method #1

You can create the mask corresponding to grIdx across all groups in one go with bsxfun(@eq,..). Now, for collapseFn as @sum, you can bring in matrix-multiplication and thus have a completely vectorized approach, like so -

M = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2))
aggArray = M.'*array

For collapseFn as @mean, you need to do a bit more work, as shown here -

M = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2))
aggArray = bsxfun(@rdivide,M,sum(M,1)).'*array

Method #2

In case you are working with a generic collapseFn, you can use the 2D mask M created with the previous method to index into the rows of array, thus changing the complexity from O(n^2) to O(n). Some quick tests suggest this to give appreciable speedup over the original loopy code. Here's the implementation -

n = size(groups,1);
M = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2));
out = zeros(n,size(array,2));
for iGr = 1:n
    out(iGr,:) = collapseFn(array(M(:,iGr),:),1);
end

Please note that the 1 in collapseFn(array(M(:,iGr),:),1) denotes the dimension along which collapseFn would be applied, so that 1 is essential there.


Bonus

By its name groupIndex seems like would hold integer values, which could be abused to have a more efficient M creation by considering each row of groupIndex as an indexing tuple and thus converting each row of groupIndex into a scalar and finally get a 1D array version of groupIndex. This must be more efficient as the datasize would be 0(n) now. This M could be fed to all the approaches listed in this post. So, we would have M like so -

dims = max(groupIndex,[],1);
agg_dims = cumprod([1 dims(end:-1:2)]);
[~,~,idx] = unique(groupIndex*agg_dims(end:-1:1).'); %//'

m = size(groupIndex,1);
M = false(m,max(idx));
M((idx-1)*m + [1:m]') = 1;
Community
  • 1
  • 1
Divakar
  • 218,885
  • 19
  • 262
  • 358
6

Mex Function 1

HAMMER TIME: Mex function to crush it: The base case test with original code from the question took 1.334139 seconds on my machine. IMHO, the 2nd fastest answer from @Divakar is:

groups2 = unique(groupIndex); 
aggArray2 = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2)).'*array; 

Elapsed time is 0.589330 seconds.

Then my MEX function:

[groups3, aggArray3] = mg_aggregate(array, groupIndex, @(x) sum(x, 1));

Elapsed time is 0.079725 seconds.

Testing that we get the same answer: norm(groups2-groups3) returns 0 and norm(aggArray2 - aggArray3) returns 2.3959e-15. Results also match original code.

Code to generate the test conditions:

array = rand(20006,10);
groupIndex = transpose([ones(1,5) 2*ones(1,6) sort(repmat((3:4001), [1 5]))]);

For pure speed, go mex. If the thought of compiling c++ code / complexity is too much of a pain, go with Divakar's answer. Another disclaimer: I haven't subject my function to robust testing.

Mex Approach 2

Somewhat surprising to me, this code appears even faster than the full Mex version in some cases (eg. in this test took about .05 seconds). It uses a mex function mg_getRowsWithKey to figure out the indices of groups. I think it may be because my array copying in the full mex function isn't as fast as it could be and/or overhead from calling 'feval'. It's basically the same algorithmic complexity as the other version.

[unique_groups, map] = mg_getRowsWithKey(groupIndex);

results = zeros(length(unique_groups), size(array,2));

for iGr = 1:length(unique_groups)
   array_subset             = array(map{iGr},:);

   %// do your collapse function on array_subset. eg.
   results(iGr,:)           = sum(array_subset, 1);
end

When you do array(groups(1)==groupIndex,:) to pull out array entries associated with the full group, you're searching through the ENTIRE length of groupIndex. If you have millions of row entries, this will totally suck. array(map{1},:) is far more efficient.


There's still unnecessary copying of memory and other overhead associated with calling 'feval' on the collapse function. If you implement the aggregator function efficiently in c++ in such a way to avoid copying of memory, probably another 2x speedup can be achieved.

Community
  • 1
  • 1
Matthew Gunn
  • 4,451
  • 1
  • 12
  • 30
  • An attribution for the `2nd fastest answer` would be nice I think. – Divakar Nov 13 '15 at 11:48
  • How do you do that? Just @? – Matthew Gunn Nov 13 '15 at 11:51
  • Yeah, like in [`this solution`](http://stackoverflow.com/a/28118804/3293881) . Sorry to be noisy about it. So, that `mex` would cover any generic `collapse_fn`, right? – Divakar Nov 13 '15 at 11:56
  • 1
    You ned to have the @(x) sum(x, 1) in there to get dimensions right. Yeah, should take any collapse function that returns double array of right dimensions. It's called with: `rhs[0] = const_cast(collapse_fn); rhs[1] = group_j_array; mexCallMATLAB(1,&lhs,2,rhs,"feval");` – Matthew Gunn Nov 13 '15 at 11:58
  • I haven't tested it a ton, so there's a chance there are bugs, but I think I got it all right. – Matthew Gunn Nov 13 '15 at 11:59
  • @Divakar thanks! You too. I had no idea until glancing at some of your answers how powerful bsxfun is, esp. when combined with collapse, permute, etc... – Matthew Gunn Nov 13 '15 at 12:25
  • Yeah!! I have been living off of those selective functions with MATLAB on Stackoverflow! :) – Divakar Nov 13 '15 at 12:27
  • 1
    @Divakar is the master at `bsxfun/permute`. His answers using pure MATLAB syntax (i.e. just using `bsxfun/permute` etc.) are amongst the fastest I've seen... as competitive as writing your own MEX code. – rayryeng Nov 13 '15 at 19:45
  • 1
    Neat alternative for [`accumarray`](http://se.mathworks.com/help/matlab/ref/accumarray.html) for the matrix case. And I honestly hadn't considered you could pass anonymous functions to MEX! – mikkola Nov 24 '15 at 07:23
  • @mikkola lol, yeah. I don't think it gains you anything, but it's kinda cool. Basically ALL the performance gain here is from using STL map datatype to keep track of which group is on which rows. – Matthew Gunn Nov 24 '15 at 07:32
5

A little late to the party, but a single loop using accumarray makes a huge difference:

function aggArray = aggregate_gnovice(array, groupIndex, collapseFn)

  [groups, ~, index] = unique(groupIndex, 'rows');
  numCols = size(array, 2);
  aggArray = nan(numel(groups), numCols);
  for col = 1:numCols
    aggArray(:, col) = accumarray(index, array(:, col), [], collapseFn);
  end

end

Timing this using timeit in MATLAB R2016b for the sample data in the question gives the following:

original | 1.127141
 gnovice | 0.002205

Over a 500x speedup!

gnovice
  • 125,304
  • 15
  • 256
  • 359
3

Doing away with the inner loop, i.e.

function aggArray = aggregate(array, groupIndex, collapseFn)

groups = unique(groupIndex, 'rows');
aggArray = nan(size(groups, 1), size(array, 2));

for iGr = 1:size(groups,1)
    grIdx = all(groupIndex == repmat(groups(iGr,:), [size(groupIndex,1), 1]), 2);
   aggArray(iGr,:) = collapseFn(array(grIdx,:));
end

and calling the collapse function with a dimension parameter

res=aggregate(a, b, @(x)sum(x,1));

gives some speedup (3x on my machine) already and avoids the errors e.g. sum or mean produce, when they encounter a single row of data without a dimension parameter and then collapse across columns rather than labels.

If you had just one group label vector, i.e. same group labels for all columns of data, you could speed further up:

function aggArray = aggregate(array, groupIndex, collapseFn)

ng=max(groupIndex);
aggArray = nan(ng, size(array, 2));

for iGr = 1:ng
    aggArray(iGr,:) = collapseFn(array(groupIndex==iGr,:));
end

The latter functions gives identical results for your example, with a 6x speedup, but cannot handle different group labels per data column.

Assuming a 2D test case for the group index (provided here as well with 10 different columns for groupIndex:

a = rand(20006,10);
B=[]; % make random length periods for each of the 10 signals
for i=1:size(a,2)
      n0=randi(10);
      b=transpose([ones(1,n0) 2*ones(1,11-n0) sort(repmat((3:4001), [1 5]))]);
      B=[B b];
end
tic; erg0=aggregate(a, B, @sum); toc % original method 
tic; erg1=aggregate2(a, B, @(x)sum(x,1)); toc %just remove the inner loop
tic; erg2=aggregate3(a, B, @(x)sum(x,1)); toc %use function below

Elapsed time is 2.646297 seconds. Elapsed time is 1.214365 seconds. Elapsed time is 0.039678 seconds (!!!!).

function aggArray = aggregate3(array, groupIndex, collapseFn)

[groups,ix1,jx] = unique(groupIndex, 'rows','first');
[groups,ix2,jx] = unique(groupIndex, 'rows','last');

ng=size(groups,1);
aggArray = nan(ng, size(array, 2));

for iGr = 1:ng
    aggArray(iGr,:) = collapseFn(array(ix1(iGr):ix2(iGr),:));
end

I think this is as fast as it gets without using MEX. Thanks to the suggestion of Matthew Gunn! Profiling shows that 'unique' is really cheap here and getting out just the first and last index of the repeating rows in groupIndex speeds things up considerably. I get 88x speedup with this iteration of the aggregation.

Felix Darvas
  • 507
  • 3
  • 5
  • Ooops, I almost missed this reading the code... An IMPORTANT subtle/clever point is that Felix passed @(x) sum(x,1) as the aggregation function! If you don't do that, BAD things can happen! This code: `collapseFn(array(grIdx,:))` would not do what's intended if array(grldx,:) is a 1 row matrix and the coallpseFn were simply `sum`. Eg. this code would want `sum([1, 3, 5])` to return `[1 3 5]` BUT what will get returned is `9` – Matthew Gunn Nov 10 '15 at 21:39
  • As a solution to the group label vector issue, you can take the 3rd output from `unique` called row-wise to get a vector. – David Kelley Nov 12 '15 at 14:57
0

Well I have a solution that is almost as quick as the mex but only using matlab. The logic is the same as most of the above, creating a dummy 2D matrix but instead of using @eq I initialize a logical array from the start.

Elapsed time for mine is 0.172975 seconds. Elapsed time for Divakar 0.289122 seconds.

function aggArray = aggregate(array, group, collapseFn)
    [m,~] = size(array);
    n = max(group);
    D = false(m,n); 
    row = (1:m)';
    idx = m*(group(:) - 1) + row;
    D(idx) = true;
    out = zeros(m,size(array,2));
    for ii = 1:n
        out(ii,:) = collapseFn(array(D(:,ii),:),1);
    end
end
eyalsoreq
  • 358
  • 1
  • 9