Performance of index creation

Question

In trying to choose which indexing method to recommend, I tried to measeure the performance. However, the measurements confused me a lot. I ran this multiple times in different orders, but the measurements remain consistent. Here is how I measured the performance:

for N = [10000 15000 100000 150000]
    x =  round(rand(N,1)*5)-2;
    idx1 = x~=0;
    idx2 = abs(x)>0;

    tic
    for t = 1:5000
        idx1 = x~=0;
    end
    toc

    tic
    for t = 1:5000
        idx2 = abs(x)>0;
    end
    toc
end

And this is the result:

Elapsed time is 0.203504 seconds.
Elapsed time is 0.230439 seconds.

Elapsed time is 0.319840 seconds.
Elapsed time is 0.352562 seconds.

Elapsed time is 2.118108 seconds. % This is the strange part
Elapsed time is 0.434818 seconds.

Elapsed time is 0.508882 seconds.
Elapsed time is 0.550144 seconds.

I checked and for values around 100000 this also happens, even at 50000 the strange measurements occur.

So my question is: Does anyone else experience this for a certain range, and what causes this? (Is it a bug?)

Well I would definitely assume `abs(x)>0` would be slower because it is really doing 2 operations, but the N trial of 100000, does not follow this. Strange. I would however almost always use `x~=0` because it is only doing one operation. Also note, the difference between the two was not as high for me as it was for you. The third trial was only separate by 0.4 seconds not 1.5 — MZimmerman6, Jul 23 '13 at 12:01
my only thought would be there is some weird memory allocation going on in the background that the 100k trial throws off — MZimmerman6, Jul 23 '13 at 12:04
I see the same thing, but not as drastically (R2012b, OS X 10.8.4). I wouldn't "assume" that `abs(x)>0` does two operations. Once JIT compiled, the sign bit can be ignored in the comparison. It's actually the `x~=0` case that's more complicated (equivalent to `x>0|x<0`). One possible reason for the difference between sizes might be [cache missing](https://en.wikipedia.org/wiki/CPU_cache#Cache_miss), which is discussed in detail [here](http://stackoverflow.com/questions/8547778/why-is-one-loop-so-much-slower-than-two-loops). — horchler, Jul 27 '13 at 21:02

nkjt · Answer 1 · 2013-07-23T14:45:55.350

I think this is something to do with JIT (results below are using 2011b). Depending on system, version of Matlab, the size of variables, and exactly what is in the loop(s), it is not always faster to use JIT. This is related to the "warm-up" effect, where sometimes if you run an m-file more than once in a session it gets quicker after the first run, as the accelerator only has to compile some parts of the code once.

JIT on (feature accel on)

Elapsed time is 0.176765 seconds.
Elapsed time is 0.185301 seconds.

Elapsed time is 0.252631 seconds.
Elapsed time is 0.284415 seconds.

Elapsed time is 1.782446 seconds.
Elapsed time is 0.693508 seconds.

Elapsed time is 0.855005 seconds.
Elapsed time is 1.004955 seconds.

JIT off (feature accel off)

Elapsed time is 0.143924 seconds.
Elapsed time is 0.184360 seconds.

Elapsed time is 0.206405 seconds.
Elapsed time is 0.306424 seconds.

Elapsed time is 1.416654 seconds.
Elapsed time is 2.718846 seconds.

Elapsed time is 2.110420 seconds.
Elapsed time is 4.027782 seconds.

ETA, kinda interesting to see what happens if you use integers instead of doubles:

JIT on, same code but converted x using int8

Elapsed time is 0.202201 seconds.
Elapsed time is 0.192103 seconds.

Elapsed time is 0.294974 seconds.
Elapsed time is 0.296191 seconds.

Elapsed time is 2.001245 seconds.
Elapsed time is 2.038713 seconds.

Elapsed time is 0.870500 seconds.
Elapsed time is 0.898301 seconds.

JIT off, using int8

Elapsed time is 0.198611 seconds.
Elapsed time is 0.187589 seconds.

Elapsed time is 0.282775 seconds.
Elapsed time is 0.282938 seconds.

Elapsed time is 1.837561 seconds.
Elapsed time is 1.846766 seconds.

Elapsed time is 2.746034 seconds.
Elapsed time is 2.760067 seconds.

Interesting to see that switching the JIT off actually makes it faster. However if the warm up would be the problem, this would not explain why 100000 is slower than 150000. Note that it also happens if you change the order `N = [10000 15000 150000 100000]` — Dennis Jaheruddin, Jul 23 '13 at 13:36
I don't think the warmup works on the command line. I'm basically guessing that there's some overhead to using JIT which may be a function of variable size, and some benefit which is also a function of variable size. Eventually benefit is greater than the overhead, but not until sizes of x >150000 or so (in this case). — nkjt, Jul 23 '13 at 14:42
It may not come from the JIT actually - please see my answer. — marsei, Jul 28 '13 at 01:38

score 6 · Accepted Answer · edited Jul 29 '13 at 08:32

This may due to some automatic optimization matlab uses for its Basic Linear Algebra Subroutine.

Just like yours, my configuration (OSX 10.8.4, R2012a with default settings) takes longer to compute idx1 = x~=0 for x (10e5 elements) than x (11e5 elements). See the left panel of the figure where the processing time (y-axis) is measured for different vector size (x-axis). You will see a lower proceesing time for N>103000. In this panel, I also displayed the number of cores that were active during the calculation. You will see that there is no drop for the one-core configuration. It means that matlab do not optimize the execution of ~= when 1 core is active (no parallelization possible). Matlab enables some optimization routines when two conditions are met: multiple cores and a vector of sufficient size.

The right panel displays the results when feature('accel','on'/off') is set to off (doc). Here, only one core is active (1-core and 4-core are identical) and therefore no optimization is possible.

Finally, the function I used for activating/deactivating the cores is maxNumCompThreads. According to Loren Shure, maxNumCompThreads controls both the JIT and BLAS. Since feature('JIT','on'/'off') did not play a role in the performance, BLAS is the last option remaining.

I will leave the final sentence to Loren: "The main message here is that you should not generally need to use this function [maxNumCompThreads] at all! Why? Because we'd like to make MATLAB do the best job possible for you." enter image description here

accel = {'on';'off'};
figure('Color','w');
N = 100000:1000:105000;

for ind_accel = 2:-1:1
    eval(['feature(''accel'',''' accel{ind_accel} ''')']);
    tElapsed = zeros(4,length(N));
    for ind_core = 1:4
        maxNumCompThreads(ind_core);
        n_core = maxNumCompThreads;
        for ii = 1:length(N)
            fprintf('core asked: %d(true:%d) - N:%d\n',ind_core,n_core, ii);
            x =  round(rand(N(ii),1)*5)-2;
            idx1 = x~=0;
            tStart = tic;
            for t = 1:5000
                idx1 = x~=0;
            end
            tElapsed(ind_core,ii) = toc(tStart);
        end
    end
    h2 = subplot(1,2,ind_accel);
    plot(N, tElapsed,'-o','MarkerSize',10);
    legend({('1':'4')'});
    xlabel('Vector size','FontSize',14);
    ylabel('Processing time','FontSize',14);
    set(gca,'FontSize',14,'YLim',[0.2 0.7]);
    title(['accel ' accel{ind_accel}]);
end

I guess that ideally optimization would kick in as soon as it leads to a better result. So, if I understand correctly: for this operation (on our computers) optimization simply kicks in too late? — Dennis Jaheruddin, Jul 29 '13 at 08:14
yep - you have found one of the "best job for you" settings/limit. But surely matlab's dev are aware of this and did that intentionally. — marsei, Jul 29 '13 at 08:29

Performance of index creation

2 Answers2