When does vectorization is a better or worse solution than a loop?

Question

In Matlab, I am trying to vectorise my code to improve the simulation time. However, the result I got was that I deteriorated the overall efficiency.

To understand the phenomenon I created 3 distinct functions that does the same thing but with different approach :

The main file :

clc,
clear,

n = 10000;
Value = cumsum(ones(1,n));

NbLoop = 10000;

time01 = zeros(1,NbLoop);
time02 = zeros(1,NbLoop);
time03 = zeros(1,NbLoop);

for test = 1 : NbLoop

    tic
    vector1 =  function01(n,Value);
    time01(test) = toc ;

    tic
    vector2 =  function02(n,Value);
    time02(test) = toc ;

    tic
    vector3 =  function03(n,Value);
    time03(test) = toc ; 

end

figure(1)
hold on

plot( time01, 'b')
plot( time02, 'g')
plot( time03, 'r')

The function 01:

function vector =  function01(n,Value)

vector = zeros( 2*n,1);
for k = 1:n
    vector(2*k -1) =  Value(k);
    vector(2*k) =  Value(k);
end

end

The function 02:

function vector =  function02(n,Value)

vector = zeros( 2*n,1);
vector(1:2:2*n) = Value; 
vector(2:2:2*n) = Value; 

end

The function 03:

function vector =  function03(n,Value)

MatrixTmp = transpose([Value(:), Value(:)]);
vector = MatrixTmp (:);

end

The blue plot correspond to the for - loop.

n = 100:

n = 100

n = 10000:

n = 10000

When I run the code with n = 100, the more efficient solution is the first function with the for loop. When n = 10000 The first function become the less efficient.

Do you have a way to know how and when to properly replace a for-loop by a vectorised counterpart?
What is the impact of index searching with array of tremendous dimensions ?
Does Matlab compute in a different manner an array of dimensions 3 or higher than a array of dimension 1 or 2?
Is there a clever way to replace a while loop that use the result of an iteration for the next iteration?

Performance benchmarks are made more accurate by using `timeit` instead of `tic`/`toc`. As you've written your tests as functions, it should be easy to switch to using `timeit`. It would also be helpful if you [edit]ed your question to include the plots, so we don't have to run your code to see the results. — Wolfie, Aug 30 '19 at 15:56
Hello, thank you for your answer, you are right, I added the figures. I didn't know this function, thank you for that. With timeit, the results with n = 100 is better for the for loop. However I also got this warning : *Warning: The measured time for F may be inaccurate because it is running too fast. Try measuring something that takes longer.* — Chewbaka, Aug 30 '19 at 16:15
Both `timeit` and `tic`/`toc` have limited precision. Besides repeating your routine a few thousand times and measuring aggregate, use third party timers such as [this one](https://www.mathworks.com/matlabcentral/fileexchange/16534-high-accuracy-timer) if you need more precision. Aside, none of your functions seem like vectorization to me, which more or less involves using Matlab's native functions for array-wide tasks -- iterator not included, and `function03` doesn't use its first input. — Argyll, Sep 01 '19 at 18:46
Please, could you give me names of some of those function : _Matlab's native functions for array-wide tasks_ ? is it like `logical` ? — Chewbaka, Sep 02 '19 at 08:30
@Argyll: `timeit` has very high precision. It figures out how many times to run your code to get meaningful values, it pre-warms the environment, and it figures out what the overhead of the looping is to remove that from the time. Your suggested alternative does none of that. I don’t think it’s any good. — Cris Luengo, Sep 02 '19 at 12:33
@CrisLuengo: The thing is, I don't know how `timeit` arrives at the precision in its output, which is often E-11 in seconds. I don't know if such precision is simply what `tic/toc` usually get divided by the number of repetitions or the proper statistical analysis is performed. Do you? In any case, [Houtzager's package](https://www.mathworks.com/matlabcentral/fileexchange/16534-high-accuracy-timer) outputs picoseconds. Taken at face value, his package seems to be able to get the highest precision possible from the cpu's clock time while `timeit` cannot. — Argyll, Sep 02 '19 at 19:44
@CrisLuengo: Given the precision difference, for any decent size data set, you don't have to measure aggregate run time. You can measure run time of each pass and average. Pre-warming can be done manually and you can make sure your cpu is ramped up before taking measurement. So those 3 benefits of `timeit` don't really apply. — Argyll, Sep 02 '19 at 19:46
@Argyll: Those benefits apply because people don’t know how to do those properly. Picosecond measurements are BS, if you don’t take into account the overhead of calling the timing functions etc. Also, repeated measures of something that takes 1 ms yields values with a very broad range. This is because there’s tons of other stuff happening on your machine at the same time. Averaging is not the right approach. `timeit` measures sufficient repetitions of the code to make the time difference meaningful, then repeats the measurement to avoid errors due to other background jobs. — Cris Luengo, Sep 02 '19 at 19:57
@CrisLuengo: The cpu clock time should be in picoseconds. Yes access time errors are in tens of ns or more depending on how the time is accessed. I simply stated the output precision of Houtzager's package. Taken at face value, I am assuming the code gets the highest precision possible. I think you meant to include a link? — Argyll, Sep 02 '19 at 20:03
@Argyll: if you want to read about `timeit` you can see [Steve’s writeup](https://blogs.mathworks.com/steve/2008/02/29/timing-code-in-matlab/) or just `edit timeit`. — Cris Luengo, Sep 02 '19 at 20:16

Cris Luengo · Accepted Answer · 2019-09-02T12:41:03.980

Using MATLAB Online I see something different:

n            10000       100
function01   5.6248e-05  2.2246e-06
function02   1.7748e-05  1.9491e-06
function03   2.7748e-05  1.2278e-06
function04   1.1056e-05  7.3390e-07  (my version, see below)

Thus, the loop version is always slowest. Method #2 is faster for very large matrices, Method #3 is faster for very small matrices.

The reason is that method #3 makes 2 copies of the data (transpose or a matrix incurs a copy), and that is bad if there's a lot of data. Method #2 uses indexing, which is expensive, but not as expensive as copying lots of data twice.

I would suggest this function instead (Method #4), which transposes only vectors (which is essentially free). It is a simple modification of your Method #3:

function vector = function04(n,Value)
vector = [Value(:).'; Value(:).'];
vector = vector(:);
end

Do you have a way to know how and when to properly replace a for-loop by a vectorised counterpart?

In general, vectorized code is always faster if there are no large intermediate matrices. For small data you can vectorize more aggressively, for large data sometimes loops are more efficient because of the reduced memory pressure. It depends on what is needed for vectorization.

What is the impact of index searching with array of tremendous dimensions?

This refers to operations such as d = data(data==0). Much like everything else, this is efficient for small data and less so for large data, because data==0 is an intermediate array of the same size as data.

Does Matlab compute in a different manner an array of dimensions 3 or higher than a array of dimension 1 or 2?

No, not in general. Functions such as sum are implemented in a dimensionality-independent way^{citation needed}.

Is there a clever way to replace a while loop that use the result of an iteration for the next iteration?

It depends very much on what the operations are. Functions such as cumsum can often be used to vectorize this type of code, but not always.

This is my timing code, I hope it shows how to properly use timeit:

%n = 10000;
n = 100;
Value = cumsum(ones(1,n));

vector1 = function01(n,Value);
vector2 = function02(n,Value);
vector3 = function03(n,Value);
vector4 = function04(n,Value);
assert(isequal(vector1,vector2))
assert(isequal(vector1,vector3))
assert(isequal(vector1,vector4))

timeit(@()function01(n,Value))
timeit(@()function02(n,Value))
timeit(@()function03(n,Value))
timeit(@()function04(n,Value))

function vector = function01(n,Value)
vector = zeros(2*n,1);
for k = 1:n
    vector(2*k-1) = Value(k);
    vector(2*k) = Value(k);
end
end

function vector = function02(n,Value)
vector = zeros(2*n,1);
vector(1:2:2*n) = Value; 
vector(2:2:2*n) = Value; 
end

function vector = function03(n,Value)
MatrixTmp = transpose([Value(:), Value(:)]);
vector = MatrixTmp(:);
end

function vector = function04(n,Value)
vector = [Value(:).'; Value(:).'];
vector = vector(:);
end

Thank you for the detailed answer, I think that indeed the problem comes from the creation of intermediate matrix. I usually tend to use reshape, repmat and transpose which deteriorate time as you said. By index searching I meant something like using `IDs = logical('Big Matrix == expression' )` then extract the data by creating a new element : `ExtractedData = BigMatrix(IDs)`. I therefore created numerous intermediate elements. — Chewbaka, Sep 02 '19 at 08:05
@Chewbaka: you don’t need `logical` there, the result of `==` already is logical. I have updated the answer to address this question. — Cris Luengo, Sep 02 '19 at 12:41
Oh, by the way, `reshape` is essentially free. `permute`, `transpose` and `repmat` require a data copy. — Cris Luengo, Sep 02 '19 at 12:52

When does vectorization is a better or worse solution than a loop?

1 Answers1

Linked