0

The data set is in the following format: Input sample matrix X and output class vector Y such that each row in X is a sample and each of its column corresponds to a feature. Each index in Y corresponds to the respective output class for the corresponding sample in X. X can contain real numbers while Y contains positive integers.

My aim is to order the data set in terms of its class. For example

X =      Y =
 1 8 3     2
 4 2 6     1
 7 8 9     2
 2 3 4     3
 1 4 6     1

should be ordered and interleaved as

X =      Y =
 4 2 6     1
 1 8 3     2
 2 3 4     3
 1 4 6     1
 7 8 9     2

The code I've attempted seems to take a long time to run as it is based on serial execution. It is the following.

X = csvread('X.csv');
Y = csvread('Y.csv');

n = size(unique(Y),1);
m = size(X,1);

for i = 1:n
   Dataset(i).X = X(Y==i,:);
   Dataset(i).Y = Y(Y==i);
end

[num, ~] = hist(Y,n);
maxfreq = max(num);

NewX = [];
NewY = [];

for j = 1:maxfreq
   for i = 1:n
      if(j <= size(Dataset(i).X,1))
         NewX = [NewX; Dataset(i).X(j,:)];
         NewY = [NewY; i];
      end
   end
end

X = NewX;
Y = NewY;
clear NewX;
clear NewY;

csvwrite('OrderedX.csv', X);
csvwrite('OrderedY.csv', Y);

Is is possible to parallelize the above code?

Ébe Isaac
  • 11,563
  • 17
  • 64
  • 97
  • So there's some maximum integer Y_max? And is the number of 1s, 2s, 3s, etc... at most 1 apart? (eg. you couldn't have five 1s but only three 2s?) Or it looks like from your code, there could be way more of one number than the other. – Matthew Gunn Nov 13 '15 at 08:49
  • Yes @MatthewGunn, there can be more of one number than the other. Eg., a random set of five 1s, three 2s and six 3s can be arranged as `[1 2 3 1 2 3 1 2 3 1 3 1 3 3]'`. – Ébe Isaac Nov 13 '15 at 09:00

2 Answers2

1

Approach #1 Using cumsum and diff following the same philosophy as the one listed in this solution -

function [outX,outY] = interleave_cumsum_diff(X,Y)

Y = Y(:);
[R,C] = find(bsxfun(@eq,Y,unique(Y).'));

lens = accumarray(C,1);
out = ones(1,numel(R));
shifts = cumsum(lens(1:end-1));
out(shifts+1) =  1- diff([0 ; shifts]);
[~,idx] = sort(cumsum(out));
sort_idx = R(idx)';

outX = X(sort_idx,:);
outY = Y(sort_idx,:);

Approach #1 Using bsxfun -

function [outX,outY] = interleave_bsxfuns(X,Y)

Y = Y(:);
[R,C] = find(bsxfun(@eq,Y,unique(Y).'));
lens = accumarray(C,1);

mask = bsxfun(@le,[1:max(lens)]',lens.');
V = zeros(size(mask));
V(mask) = R;
Vt = V.';
sort_idx = Vt(mask.');
outX = X(sort_idx,:);
outY = Y(sort_idx,:);

Sample run -

1) Inputs :

>> X
X =
     1     8     3
     4     2     6
     7     8     9
     2     3     4
     1     4     6
>> Y
Y =
     2
     1
     2
     3
     1

2) Outputs from the two approaches :

>> [NewX,NewY] = interleave_cumsum_diff(X,Y)
NewX =
     4     2     6
     1     8     3
     2     3     4
     1     4     6
     7     8     9
NewY =
     1
     2
     3
     1
     2
>> [NewX,NewY] = interleave_bsxfuns(X,Y)
NewX =
     4     2     6
     1     8     3
     2     3     4
     1     4     6
     7     8     9
NewY =
     1
     2
     3
     1
     2
Community
  • 1
  • 1
Divakar
  • 218,885
  • 19
  • 262
  • 358
1

You're resizing matrices all the time which is expensive. A quick speedup for your algorithm would be to set NewX and NewY to the proper size and just copy data in:

NewX = zeros(size(X));
NewY = zeros(size(Y));
k = 1;
for j = 1:maxfreq
  for i = 1:n
     if(j <= size(Dataset(i).X,1))
       NewX(k,:) = Dataset(i).X(j,:);
       NewY(k) = i;
       k=k+1;
     end
   end
end
Matthew Gunn
  • 4,451
  • 1
  • 12
  • 30
  • Wow! That sure did the trick. The your code solved the problem in less than a splinter fraction of the previous time! Answer accepted! – Ébe Isaac Nov 13 '15 at 09:21