2

I have two vectors

data vector: A = [1 2 2 1 2 6; 2 3 2 3 3 5]
label vector: B = [1 2 1 2 3 NaN]

I want to take the mean of all columns that have the same label and output these as a matrix sorted by label number, ignoring NaNs. So, in this example I would want:

labelmean(A,B) = [1.5 1.5 2; 2 3 3]

This can be done with a for-loop like this.

function out = labelmean(data,label)
out=[];
for i=unique(label)
    if isnan(i); continue; end
    out = [out, mean(data(:,label==i),2)];
end 

However, I'm dealing with huge arrays containing many datapoints and labels. Additionally, this code snippet will be executed often. I'm wondering if there is a more efficient way to do this without looping over every individual label.

rayryeng
  • 102,964
  • 22
  • 184
  • 193
Poelie
  • 711
  • 6
  • 18

3 Answers3

5

This would be a good case of using accumarray. Think of accumarray as a miniature MapReduce paradigm. There are keys and values and so the job of accumarray is to group all of the values that share the same key together and you do something with those values. In your case, the keys would be the elements in B but what the values are going to be are the row locations that you need for the corresponding values in B. Basically, for each value in B, the position in B tells you which row you need to access in A. Therefore, we simply need to grab all of the row locations that map to the same ID, access the rows of A, then find the mean over all rows. We need to be careful in that we ignore values that are NaN. We can filter these out before calling accumarray. The "something" that you do in accumarray traditionally should output a single number, but we are in fact outputting a column vector for each label. Therefore, a trick is to wrap the output into a cell array, then use cat combined with comma-separated lists to convert the output into a matrix.

As such, something like this should work:

% Sample data
A = [1 2 2 1 2 6; 2 3 2 3 3 5];
B = [1 2 1 2 3 NaN];

% Find non-NaN locations
mask = ~isnan(B);

% Generate row locations that are not NaN as well as the labels
ind = 1 : numel(B);
Bf = B(mask).';
ind = ind(mask).';

% Find label-wise means
C = accumarray(Bf, ind, [], @(x) {mean(A(:,x), 2)});

% Convert to numeric matrix
out = cat(2, C{:});

If you don't like the use of a temporary variable for finding those non-NaN values, we can do this with less lines of code, but you will still need the vector of row indices to determine where we need to sample from:

% Sample data
A = [1 2 2 1 2 6; 2 3 2 3 3 5];
B = [1 2 1 2 3 NaN];

% Solution
ind = 1 : numel(B);
C = accumarray(B(~isnan(B)).', ind(~isnan(B)).', [], @(x) {mean(A(:,x), 2)});
out = cat(2, C{:});

With your data, we get:

>> out

out =

    1.5000    1.5000    2.0000
    2.0000    3.0000    3.0000
rayryeng
  • 102,964
  • 22
  • 184
  • 193
  • 1
    Well done! I immediately rejected `acummarray` because the data is matrix, not a vector, but of course it could be done! – Luis Mendo Mar 01 '17 at 20:12
  • @LuisMendo I know :D I really wanted to use `accumarray`.... I didn't want to give up. I figured a workaround would be to use the row indices as the input instead, then wrap accessing the matrix in an anonymous function.... but thank you :) – rayryeng Mar 01 '17 at 20:13
5

Here's one approach:

  1. Get the indices of labels not containing NaNs.
  2. Create a sparse matrix of zeros and ones that multiplied by A would give the desired row sums.
  3. Divide that matrix by the sum of each column, so that the sums become averages.
  4. Apply matrix multiplication to get the result, and convert to a full matrix.

Code:

I = find(~isnan(B));                                 % step 1
t = sparse(I, B(I), 1, size(A,2), max(B(I)));        % step 2
t = bsxfun(@rdivide, t, sum(t,1));                   % step 3
result = full(A*t);                                  % step 4
Luis Mendo
  • 110,752
  • 13
  • 76
  • 147
  • Gotta love matrix multiplication :D. It works with the one-hot encoding matrix you have there. – rayryeng Mar 01 '17 at 20:11
  • The disadvantage of this method is that `t` may be really large, which may down the program much more than the for loop. – m7913d Mar 01 '17 at 20:22
  • @m7913d Good point. Solved now, by keeping the matrix sparse until the end – Luis Mendo Mar 01 '17 at 20:25
  • Under Matlab2016b (and presumably future versions) this could be further optimized by replacing step 3 with t=t./sum(t,1) – Poelie Mar 01 '17 at 21:16
  • @Poelie Yes, but I'm not sure that's an optimization (i.e. faster). I think it does the same, only with cleaner syntax – Luis Mendo Mar 01 '17 at 21:20
  • 1
    @LuisMendo Huh I remembered reading that implicit expansion was faster than bsxfun, but I can't seem to find any evidence for it now. I stand corrected. – Poelie Mar 01 '17 at 21:26
  • @Poelie Oh, that's interesting. Please let me know if you find it. That would be a good reason to avoid `bsxfun` in favour of implicit expansion. For now I prefer `bsxfun` for backwards compatibility (and for nostalgic reasons...) – Luis Mendo Mar 01 '17 at 21:28
  • 1
    @Poelie You are correct. Broadcasting is at least as fast than or faster than `bsxfun` for R2016b. https://blogs.mathworks.com/loren/2016/10/24/matlab-arithmetic-expands-in-r2016b/ - Comment #6 made by Steve Eddins from MathWorks. Basically for smaller sized matrices, implicit expansion is faster. For larger sized matrices, performance is near equal. – rayryeng Mar 01 '17 at 21:32
  • Going to go ahead and accept this as the answer, with credit to m7913d for doing the benchmarking. – Poelie Mar 01 '17 at 21:34
  • @LuisMendo Added an edit to prevent this code from providing a wrong answer when the NaNs are not at the end of the label vector. – Poelie Mar 02 '17 at 15:19
  • 1
    @Poelie Thanks, good catch! I've made a small additional modification, since `B2` was now redundant – Luis Mendo Mar 02 '17 at 15:25
  • @Poelie You may be interested in [this](http://stackoverflow.com/questions/42559922/how-much-faster-is-implicit-expansion-compared-with-bsxfun/42559923#42559923). Thanks again for the notice! – Luis Mendo Mar 02 '17 at 15:50
1

This answer is not a new method, but a benchmark of the given answers, because if you talk about performance, you always have to benchmark it.

clear all;
% I tried to make a real-life dataset (the original author may provide a
% better one)
A = [1:3e4; 1:10:3e5; 1:100:3e6]; % large dataset
B = repmat(1:1e3, 1, 3e1); % large number of labels

labelmean(A,B);
labelmeanLuisMendoA(A,B);
labelmeanLuisMendoB(A,B);
labelmeanRayryeng(A,B);

function out = labelmean(data,label)
    tic
    out=[];
    for i=unique(label)
        if isnan(i); continue; end
        out = [out, mean(data(:,label==i),2)];
    end
    toc
end

function out = labelmeanLuisMendoA(A,B)
    tic
    B2 = B(~isnan(B)); % remove NaN's
    t = full(sparse(1:numel(B2),B2,1,size(A,2),max(B2))); % template matrix
    out = A*t; % sum of columns that share a label
    out = bsxfun(@rdivide, out, sum(t,1)); % convert sum into mean
    toc
end

function out = labelmeanLuisMendoB(A,B)
    tic
    B2 = B(~isnan(B));                                   % step 1
    t = sparse(1:numel(B2), B2, 1, size(A,2), max(B2));  % step 2
    t = bsxfun(@rdivide, t, sum(t,1));                   % step 3
    out = full(A*t);                                  % step 4
    toc
end

function out = labelmeanRayryeng(A,B)
    tic
    ind = 1 : numel(B);
    C = accumarray(B(~isnan(B)).', ind(~isnan(B)).', [], @(x) {mean(A(:,x), 2)});
    out = cat(2, C{:});
    toc
end

The output is:

Elapsed time is 0.080415 seconds. % original
Elapsed time is 0.088427 seconds. % LuisMendo original answer
Elapsed time is 0.004223 seconds. % LuisMendo optimised version
Elapsed time is 0.037347 seconds. % rayryeng answer

For this dataset LuisMendo optimised version is the clear winner, whereas his first version was slower than the original one.

=> Don't forget to benchmark your performance!

EDIT: Test platform specifications

  • Matlab R2016b
  • Ubuntu 64-bit
  • 15.6 GiB RAM
  • Intel® Core™ i7-5600U CPU @ 2.60GHz × 4
m7913d
  • 10,244
  • 7
  • 28
  • 56
  • It's nice to see that `accumarray` is faster than the previous version of Luis Mendo's answer. Obviously going to `sparse` is the winner. Minor note: I've edited your post to correct my callsign. I hope you don't mind. I'm also curious to know which MATLAB version you're using, as well as your CPU and RAM configuration. – rayryeng Mar 01 '17 at 21:00
  • 2
    Use `timeit`, single `tic` and `toc` calls are poor indicators of performance – sco1 Mar 01 '17 at 21:05
  • 1
    Result in Octave : 0.192002 ,0.231313, 0.406085, 0.119208 seconds. It seems that rayryeng method is fastest. Try it on [rextester](http://rextester.com/FPDXV98643) – rahnema1 Mar 01 '17 at 21:09
  • @rahnema1 Interesting. Thank you for the additional benchmark :) – rayryeng Mar 01 '17 at 21:28
  • I reran this code, but replaced the tic/toc with timeit as suggested by excaza and found results similar to those presented by m7913d. – Poelie Mar 01 '17 at 21:34
  • @rayryeng I added my platform specifications – m7913d Mar 01 '17 at 21:54
  • Thanks a bunch. To add to @excaza 's comment, usually using `tic` and `toc` requires some warmup time before actually using it. Therefore, put `tic` and `toc` in a dummy loop for a several iterations before actually doing any benchmarking. See this example here: http://stackoverflow.com/questions/26137612/how-to-make-a-diagonal-divider-of-zeros-and-ones-using-matlab/26137901#26137901. It would also be fruitful to loop the function calls over a certain number of iterations and take the average time. Just calling `tic` and `toc` once is unstable. – rayryeng Mar 01 '17 at 21:56