This answer is not a new method, but a benchmark of the given answers, because if you talk about performance, you always have to benchmark it.
clear all;
% I tried to make a real-life dataset (the original author may provide a
% better one)
A = [1:3e4; 1:10:3e5; 1:100:3e6]; % large dataset
B = repmat(1:1e3, 1, 3e1); % large number of labels
labelmean(A,B);
labelmeanLuisMendoA(A,B);
labelmeanLuisMendoB(A,B);
labelmeanRayryeng(A,B);
function out = labelmean(data,label)
tic
out=[];
for i=unique(label)
if isnan(i); continue; end
out = [out, mean(data(:,label==i),2)];
end
toc
end
function out = labelmeanLuisMendoA(A,B)
tic
B2 = B(~isnan(B)); % remove NaN's
t = full(sparse(1:numel(B2),B2,1,size(A,2),max(B2))); % template matrix
out = A*t; % sum of columns that share a label
out = bsxfun(@rdivide, out, sum(t,1)); % convert sum into mean
toc
end
function out = labelmeanLuisMendoB(A,B)
tic
B2 = B(~isnan(B)); % step 1
t = sparse(1:numel(B2), B2, 1, size(A,2), max(B2)); % step 2
t = bsxfun(@rdivide, t, sum(t,1)); % step 3
out = full(A*t); % step 4
toc
end
function out = labelmeanRayryeng(A,B)
tic
ind = 1 : numel(B);
C = accumarray(B(~isnan(B)).', ind(~isnan(B)).', [], @(x) {mean(A(:,x), 2)});
out = cat(2, C{:});
toc
end
The output is:
Elapsed time is 0.080415 seconds. % original
Elapsed time is 0.088427 seconds. % LuisMendo original answer
Elapsed time is 0.004223 seconds. % LuisMendo optimised version
Elapsed time is 0.037347 seconds. % rayryeng answer
For this dataset LuisMendo optimised version is the clear winner, whereas his first version was slower than the original one.
=> Don't forget to benchmark your performance!
EDIT: Test platform specifications
- Matlab R2016b
- Ubuntu 64-bit
- 15.6 GiB RAM
- Intel® Core™ i7-5600U CPU @ 2.60GHz × 4