Is it possible to make this vectorized MATLAB code faster?

Question

In my program I need to calculate the sum:

sum .

I calculate this sum 2500 times with new values of C and z.

Argument z may be a vector. I wrote straightforward for-loop and vectorized version code as follows:

K = 200;
n_z = 40000;
C = ones(K,1); % an example, in real life the arey some coefficients (at next call will be new)
k = 0:K-1;
z = linspace(0, 2*pi, n_z); % at next call will be new

tic;
    my_sum_for = zeros(1, K);
    for i=1:n_z
       my_sum_for(i) = C' * tan(k' * z(i));
    end
toc; % Elapsed time is 1.820485 seconds.

tic;
     my_sum = C' * tan(k' * z);
toc; % Elapsed time is 0.160924 seconds.

Vectorized version is faster, but not enough. Is it possible to improve vectorized version?

After Dominique Jacquel's answer I have this vectorized version, it's faster:

    K = 200;
    n_z = 40000;
    C = ones(K,1)';  % an example, in real life they are some coefficients (at next call will be new)
    k = (0:K-1)';
    z = linspace(0, 2*pi, n_z);  % at next call will be new

    tic;
        my_sum_for = zeros(1, K);
        for i=1:n_z
           my_sum_for(i) = C * tan(k * z(i));
        end
    toc; % Elapsed time is 1.521587 seconds.

    tic;
         my_sum = C * tan(k * z);
    toc; % Elapsed time is 0.125468 seconds.

Is it possible to improve vectorized version even more (bsxfun, arrayfun or something)? The time of 250 seconds is still slow for me (it is 75% of all compuations).

`all(my_sum_for == my_sum) -> ans = 0` ... so, you haven't checked this? Seems to me your doing some odd transposing that's not needed.., — Rody Oldenhuis, Sep 06 '12 at 11:24
I checked this by calculating norm(my_sum_for, my_sum) = 1.7861e-10. It is not new for me that vectorized and looped versions of the same code produces slightly different results. — N0rbert, Sep 06 '12 at 11:36
Is the actual contents of `C` really just ones? And there are some repeated elements in the matrix `k*z` that you might be able to skip computing the `tan` of, but given the values of `K` and `n_Z` there's not much to save there. — Rody Oldenhuis, Sep 06 '12 at 14:33
It is an example, C is array of some coefficients. Memory consumption of the latest version is the smallest. May be there is a faster solution with more memory consumption? My program calls this code 2500 times, so it is 250 s. I hope that we can make code faster. — N0rbert, Sep 06 '12 at 15:18
What changes between each of your 2500 calls to this code? Is this code always called with new values for `C` and `z`? If not, possibly, you make use of that to optimize further. — grungetta, Sep 06 '12 at 17:09
Just curious, how much of your execution time, does this function represent? — zeFrenchy, Sep 06 '12 at 17:55
At each of my 2500 calls `C` and `z` values are changing. This function takes 75 % of all computations time. — N0rbert, Sep 06 '12 at 18:32
I'd be surprised if you manage to squeeze much more out of that in pure single threaded matlab. After that, go to a C implementation which might gain you a little more. Note that this code can be easily run in parallel, in which case you could gain a lot running it on multicore machines. — zeFrenchy, Sep 06 '12 at 20:05

score 6 · Accepted Answer · edited May 23 '17 at 12:13

I think you're pretty close to the hardware limits here. Matrix multiplications in Matlab are done with the BLAS libraries, which have proven tough to beat when it comes to performance.

AFAIK, the tangent function has actual dedicated hardware to compute its value. Also, Matlab automatically distributes trig functions of large matrices over multiple cores, so there's basically little to improve there.

Also, correct me if I'm wrong, but given the probable data overheads and memory issues, I think this computation would actually be slower on GPUs.

Now if you compare these:

tic;
for ii = 1:10
    my_sum = C * tan(k * z);
end
toc 

tic;
for ii = 1:10
    my_sum_notan = C * k * z;
end
toc

You'll see that all of the pain comes from the tangent function, so you had best focus on this. As you can read here, speeding trig functions up is basically only possible if you sacrifice some accuracy.

In all you should ask yourself these questions:

Are you willing to give up full double precision, or are, say, 6 digits "close enough"?
Can't you re-formulate the problem so that the tangent is computed afterwards? or before? Or anyway on a significantly smaller amount of elements? In the problem as stated this is obviously not possible, but I don't know the full code -- there might be some nice trig-identities that could apply to your problem.
Given all of the above, does the amount of effort needed to optimize this further really outweigh the longer runtime? 250 seconds does not sound too bad compared to writing custom, hardly portable MEX functions that implement crippled-but-fast trig functions.

Good way to show the culprit here (tan)! You get one up vote from me :) Are you positive MATLAB will automatically multithread trig operations? Even without a parallel toolkit installed? — zeFrenchy, Sep 06 '12 at 21:16
@DominiqueJacquel It was a new feature in R2007a (see [here](http://www.mathworks.nl/support/solutions/en/data/1-4PG4AN/?solution=1-4PG4AN) for example). You can see this in task manager/top if you do the calculation in an infinite while loop; you'll see that MATLAB is using most/all cores. — Rody Oldenhuis, Sep 07 '12 at 02:52
@Rody Oldenhuis Thank you for your complete answer. Sometimes I think about calculating tasks in pure Fortran/C or using GotoBLAS/OpenBLAS, they are faster (in execution time) but development of such code is slower. — N0rbert, Sep 07 '12 at 09:32
@Rody, i withdraw my statement ... redid the test *after* coffee, and it does use multiple core if you apply trig function to matrices. Good news :) ... deleted idiotic comment! cheers. — zeFrenchy, Sep 10 '12 at 09:20

score 3 · Answer 2 · answered Sep 06 '12 at 11:22

Do as many matrix operations as possible upfront (transposition in this case) to save some time in the loop

K = 200;
n_z = 40000;
C = ones(K,1)';
k = (0:K-1)';
z = linspace(0, 2*pi, n_z);

tic;
    my_sum_for = zeros(1, K);
    for i=1:n_z
       my_sum_for(i) = C * tan(k * z(i));
    end
toc

tic;
     my_sum = C * tan(k * z);
toc;

my execution times before

Elapsed time is 1.266158 seconds.
Elapsed time is 0.531173 seconds.

and after

Elapsed time is 0.496803 seconds.
Elapsed time is 0.185396 seconds.

angainor · Answer 3 · 2012-09-13T18:15:50.257

Here is a version that runs slightly faster on my computer:

k = repmat((0:K-1)', 1, n_z);
z = repmat(linspace(0, 2*pi, n_z), K, 1);
C = ones(1, K);
tic
my_sum = C*tan(k.*z);
toc

Essentially, instead of an outer product of k and z I operate directly on matrices.

First version

Elapsed time is 0.652923 seconds.
Elapsed time is 0.240300 seconds.

After Dominique Jacquel's answer

Elapsed time is 0.376218 seconds.
Elapsed time is 0.214047 seconds.

My version

Elapsed time is 0.168535 seconds.

You may have to add the cost of repmats, but maybe you can do that once only, I don't know the rest of the code.

I fully agree with Rody Oldenhuis. The majority of the work lies in the tangent function. I can tell you more. k.*z computation is very efficient and can not be improved much. If you calculate the memory bandwidth, it gets around 10GB/s on my computer. The peak I can get is around 16GB/s, so its close. Not many possibilities there. Same with C*T. That is a simple BLAS2 matrix-vector multiplication, which is memory bounded. For the system sizes you are showing the MATLAB overhead is not too big.

Edit: as Rody mentioned, new versions of MATLAB already do parallelize tan(). So not much here either.

You can only hope to improve the tan() - possibly by running it in parallel. After all, it is a trivially parallelizable task... Consider exporting just this to a MEX file, which will use OpenMP. Very simple work, lots of speedup if you have a few spare cores.

Thank you. What is interesting new versions of MATLAB (tested on 2008b) does not support `mcc -x` flag: `Error: -x is no longer supported. The MATLAB Compiler no longer generates MEX files because there is no longer any performance advantage to doing so: the MATLAB JIT accelerates M-files by default. To hide proprietary algorithms, use the PCODE function. ` May be I'll try to call external self-made __tan__-function. — N0rbert, Sep 07 '12 at 09:36
I meant to write your own MEX function and compile it using mex. This way you could skip the entire repmat / k * z business. This only eats up the memory bandwidth and affects execution time. k*z is a BLAS2 operation. You do not need to explicitly create the matrix. You can compute its entries on the fly.. — angainor, Sep 07 '12 at 09:45

Is it possible to make this vectorized MATLAB code faster?

3 Answers3