-1

I have to execute this for cycle:

load('Y');
X_test = ...;
Y_test = ...;
X_train = ...;
Y_train = ...;

    for i=1:length(Y.Y)
        if Y.Y(i,1) == l
            current_test_data = [current_test_data; X_test(i,:)];
            current_test_labes = [current_test_labes; Y_test(i,:)];
        else
            current_train_data = [current_train_data; X_train(i,:)];
            current_train_labes = [current_train_labes; Y_train(i,:)];            
        end
    end

But length(Y.Y) is 2300250 so this execution takes a long time. There is a faster way to do that?

Sara Savio
  • 49
  • 1
  • 7
  • 4
    You are appending wrong. If you need to append, do `x(end+1)=0`, rather than `x=[x,0]`. See [here](https://stackoverflow.com/a/48990695/7328782). But even better is to pre-allocate. You can gain immensely by creating an array large enough to hold the largest possible output, then trimming it down at the end of the loop. Incrementing arrays in a loop like you do here is **very** expensive. – Cris Luengo Dec 07 '18 at 06:57
  • Preallocate for example what in my code? You refer to "current_test_data" and so on? – Sara Savio Dec 07 '18 at 07:37
  • Yes, make an array large enough to hold the data you'll be saving in it, so you don't need to append data to the array. See: https://www.mathworks.com/help/matlab/matlab_prog/preallocating-arrays.html – Cris Luengo Dec 07 '18 at 07:38
  • Could this speed up a considerable amount? – Sara Savio Dec 07 '18 at 07:40
  • 1
    yes. try it out. – shamalaia Dec 07 '18 at 08:09
  • 3
    Can't you just use logical indexing here? Get rid of the loops and use `idx = (Y.Y(:,1) == l); current_test_data = X_test(idx, :); current_test_labes = Y_test(idx,:);` etc? If this doesn't work then please provide a [mcve] which we can test. – Wolfie Dec 07 '18 at 08:40

1 Answers1

2

What you are doing is indeed not great in terms of performance.

The first issue is the loop. Matlab does not handle them very fast; when possible, vectorized operations should be preferred as they are well optimized. For instance, it is much faster to execute A=B.*C than for ii=1:length(B), A(ii)=B(ii)*C(ii);end

The second issue is that you are concatenating arrays within the loop. current_test_data starts as a small array that grows over time. Each time some data is appended, memory needs to be reallocated. The data may have to be moved to another place in the memory. Since Matlab stores data in column major order, adding an extra row also means that all samples but the first column have to be moved (while adding an extra column is just appending data at the end). All of this conspires to make a terrible performance. With small arrays, this may not be noticeable; but as you start moving megabytes or more of data in memory at each iteration, performance will plummet

The usual solution, when the final size of the array is known, is to pre-allocate arrays, for instance current_test_data = zeros(expected_rows,expected_columns);, and put the data straight away where it belongs: current_test_data(jj,:) = some_matrix(ii,:);. No more memory allocation, no more memory moves, no more shuffling-samples-around.

But then, in your specific case, the solution lies with the first point: using vectorized notation is the solution. It will pre-allocate the arrays to the right size and copy data efficiently.

sel = Y.Y(:,1)==1; % Builds a logical vector
% Selects data based on logical vector
current_test_data =  X_test(sel,:);
current_test_labes = Y_test(sel,:);
current_train_data = X_train(~sel,:);
current_train_labes = Y_train(~sel,:);  
Brice
  • 1,560
  • 5
  • 10