Difficulties to measure C/C++ performance

Question

I wrote a piece of C code to show a point in a discussion about optimizations and branch prediction. Then I noticed even more diverse outcome than I did expect. My goal was to write it in a language that is common subset between C++ and C, that is standard-compliant for both languages and that is fairly portable. It was tested on different Windows PCs:

#include <stdio.h>
#include <time.h>

/// @return - time difference between start and stop in milliseconds
int ms_elapsed( clock_t start, clock_t stop )
{
    return (int)( 1000.0 * ( stop - start ) / CLOCKS_PER_SEC );
}

int const Billion = 1000000000;
/// & with numbers up to Billion gives 0, 0, 2, 2 repeating pattern 
int const Pattern_0_0_2_2 = 0x40000002; 

/// @return - half of Billion  
int unpredictableIfs()
{
    int sum = 0;
    for ( int i = 0; i < Billion; ++i )
    {
        // true, true, false, false ...
        if ( ( i & Pattern_0_0_2_2 ) == 0 )
        {
            ++sum;
        }
    }
    return sum;
}

/// @return - half of Billion  
int noIfs()
{
    int sum = 0;
    for ( int i = 0; i < Billion; ++i )
    {
        // 1, 1, 0, 0 ...
        sum += ( i & Pattern_0_0_2_2 ) == 0;
    }
    return sum;
}

int main()
{
    clock_t volatile start;
    clock_t volatile stop;
    int volatile sum;
    printf( "Puzzling measurements:\n" );

    start = clock();
    sum = unpredictableIfs();
    stop = clock();
    printf( "Unpredictable ifs took %d msec; answer was %d\n"
          , ms_elapsed(start, stop), sum );

    start = clock();
    sum = unpredictableIfs();
    stop = clock();
    printf( "Unpredictable ifs took %d msec; answer was %d\n"
          , ms_elapsed(start, stop), sum );

    start = clock();
    sum = noIfs();
    stop = clock();
    printf( "Same without ifs took %d msec; answer was %d\n"
          , ms_elapsed(start, stop), sum );

    start = clock();
    sum = unpredictableIfs();
    stop = clock();
    printf( "Unpredictable ifs took %d msec; answer was %d\n"
          , ms_elapsed(start, stop), sum );
}

Compiled with VS2010; /O2 optimizations Intel Core 2, WinXP results:

Puzzling measurements:
Unpredictable ifs took 1344 msec; answer was 500000000
Unpredictable ifs took 1016 msec; answer was 500000000
Same without ifs took 1031 msec; answer was 500000000
Unpredictable ifs took 4797 msec; answer was 500000000

Edit: Full switches of compiler:

/Zi /nologo /W3 /WX- /O2 /Oi /Oy- /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm- /EHsc /GS /Gy /fp:precise /Zc:wchar_t /Zc:forScope /Fp"Release\Trying.pch" /Fa"Release\" /Fo"Release\" /Fd"Release\vc100.pdb" /Gd /analyze- /errorReport:queue

Other person posted such ... Compiled with MinGW, g++ 4.71, -O1 optimizations Intel Core 2, WinXP results:

Puzzling measurements:
Unpredictable ifs took 1656 msec; answer was 500000000
Unpredictable ifs took 0 msec; answer was 500000000
Same without ifs took 1969 msec; answer was 500000000
Unpredictable ifs took 0 msec; answer was 500000000

Also he posted such results for -O3 optimizations:

Puzzling measurements:
Unpredictable ifs took 1890 msec; answer was 500000000
Unpredictable ifs took 2516 msec; answer was 500000000
Same without ifs took 1422 msec; answer was 500000000
Unpredictable ifs took 2516 msec; answer was 500000000

Now I have question. What is going on here?

More specifically ... How can a fixed function take so different amounts of time? Is there something wrong in my code? Is there something tricky with Intel processor? Are the compilers doing something odd? Can it be because of 32 bit code ran on 64 bit processor?

Thanks for attention!

Edit: I accept that g++ -O1 just reuses returned values in 2 other calls. I also accept that g++ -O2 and g++ -O3 have defect that leaves the optimization out. Significant diversity of measured speeds (450% !!!) seems still mysterious.

I looked at disassembly of code produced by VS2010. It did inline unpredictableIfs 3 times. The inlined code was fairly similar; the loop was same. It did not inline noIfs. It did roll noIfs out a bit. It takes 4 steps in one iteration. noIfs calculate like was written while unpredictableIfs use jne to jump over increment.

Have you actually checked that your "if" and "noIfs" are actually different once compiled? Many compilers these days can figure out that it's better to do some math than conditional jumps for the type of things you are doing... — Mats Petersson, Feb 19 '13 at 19:20
@MatsPetersson The generated code with VS2010 was fairly different. I will edit post to explain. — Öö Tiib, Feb 19 '13 at 19:31
@ÖöTiib be sure to show us the command line switches you used when compiling with VS2010. — Nik Bougalis, Feb 19 '13 at 19:31
It's very hard to write good speed tests. Compilers are much smarter than you could ever imagine. — Pete Becker, Feb 19 '13 at 23:52
@PeteBecker that is why I asked. If it is hard then it is worth learning. ;) — Öö Tiib, Feb 19 '13 at 23:57

Daniel Fischer · Accepted Answer · 2013-02-19T20:11:39.927

13

With -O1, gcc-4.7.1 calls unpredictableIfs only once and resuses the result, since it recognizes that it's a pure function, so the result will be the same every time it's called. (Mine did, verified by looking at the generated assembly.)

With higher optimisation level, the functions are inlined, and the compiler doesn't recognize that it's the same code anymore, so it is run each time a function call appears in the source.

Apart from that, my gcc-4.7.1 deals best with unpredictableIfs when using -O1 or -O2 (apart from the reuse issue, both produce the same code), while noIfs is treated much better with -O3. The timings between the different runs of the same code are however consistent here - equal or differing by 10 milliseconds (granularity of clock), so I have no idea what could cause the substantially different times for unpredictableIfs you reported for -O3.

With -O2, the loop for unpredictableIfs is identical to the code generated with -O1 (except for register swapping):

.L12:
    movl    %eax, %ecx
    andl    $1073741826, %ecx
    cmpl    $1, %ecx
    adcl    $0, %edx
    addl    $1, %eax
    cmpl    $1000000000, %eax
    jne .L12

and for noIfs it's similar:

.L15:
    xorl    %ecx, %ecx
    testl   $1073741826, %eax
    sete    %cl
    addl    $1, %eax
    addl    %ecx, %edx
    cmpl    $1000000000, %eax
    jne .L15

where it was

.L7:
    testl   $1073741826, %edx
    sete    %cl
    movzbl  %cl, %ecx
    addl    %ecx, %eax
    addl    $1, %edx
    cmpl    $1000000000, %edx
    jne .L7

with -O1. Both loops run in similar time, with unpredictableIfs a bit faster.

With -O3, the loop for unpredictableIfs becomes worse,

.L14:
    leal    1(%rdx), %ecx
    testl   $1073741826, %eax
    cmove   %ecx, %edx
    addl    $1, %eax
    cmpl    $1000000000, %eax
    jne     .L14

and for noIfs (including the setup-code here), it becomes better:

    pxor    %xmm2, %xmm2
    movq    %rax, 32(%rsp)
    movdqa  .LC3(%rip), %xmm6
    xorl    %eax, %eax
    movdqa  .LC2(%rip), %xmm1
    movdqa  %xmm2, %xmm3
    movdqa  .LC4(%rip), %xmm5
    movdqa  .LC5(%rip), %xmm4
    .p2align 4,,10
    .p2align 3
.L18:
    movdqa  %xmm1, %xmm0
    addl    $1, %eax
    paddd   %xmm6, %xmm1
    cmpl    $250000000, %eax
    pand    %xmm5, %xmm0
    pcmpeqd %xmm3, %xmm0
    pand    %xmm4, %xmm0
    paddd   %xmm0, %xmm2
    jne .L18

.LC2:
    .long   0
    .long   1
    .long   2
    .long   3
    .align 16
.LC3:
    .long   4
    .long   4
    .long   4
    .long   4
    .align 16
.LC4:
    .long   1073741826
    .long   1073741826
    .long   1073741826
    .long   1073741826
    .align 16
.LC5:
    .long   1
    .long   1
    .long   1
    .long   1

it computes four iterations at once, and accordingly, noIfs runs almost four times as fast then.

edited Feb 19 '13 at 20:11

answered Feb 19 '13 at 19:25

Daniel Fischer

181,706
17
308
431

Verified with 4.7.2. The optimization on -O1 also doesn't happen when a side effect, such as a printf, is inserted into the function. – thiton Feb 19 '13 at 19:27
1

What? Somehow, I feel this should be a GCC bug. – Jonas Schäfer Feb 19 '13 at 19:28
@JonasWielicki why? the result of `unpredictableIfs` is always the same and can be calculated at compile time. Or are you puzzled as to why it's not reused with more aggresive optimizations? – Nik Bougalis Feb 19 '13 at 19:29
@JonasWielicki Why? Because it's not clever enough to reuse the result with `-O2` or `-O3`, or because it reuses the result with `-O1`? – Daniel Fischer Feb 19 '13 at 19:29
1

@NikBougalis Of course not because of the inlining! I consider it a bug that the code becomes actually worse with ``-O2`` and upwards. – Jonas Schäfer Feb 19 '13 at 19:31
Yes, I agree, it's a bug that it doesn't realize that the result is the same for calling the function twice. – Mats Petersson Feb 19 '13 at 19:32
@MatsPetersson Even if I tell it explicitly with ``__attribute__((pure))``, it will still prefer inlining at ``-O3`` and ``-O2``. putting ``__attribute__((noinline))`` helps though ;) – Jonas Schäfer Feb 19 '13 at 19:33
You answer reason of results with -O1, that deserves +1. Great ... now what about g++ -O3 and VS2010 /O2 results? Why are those so different during same run? – Öö Tiib Feb 19 '13 at 19:35
I don't know. The compiler may simply be optimizing the code more aggressively and discarding the return value (which can be recalculated) to avoid, for example, register pressure. It's hard to tell and I would *definitely* not jump to the conlcusion that it's a bug. Frankly, I am surprised that the compiler doesn't calculate the result of `unpredictbleIfs` at compile time. – Nik Bougalis Feb 19 '13 at 19:36
@ÖöTiib I don't get so different times for `unpredictableIfs` with `-O3`, everything is within normal fluctuation there - although slower than with `-O2` or ´-O1`. However, I get a **much** faster `noIfs` with `-O3` (still trying to understand what it did there). As for VS2010, I can't say anything since I don't have that installed. – Daniel Fischer Feb 19 '13 at 19:43
I have just run it on my Linux machine, and the results are very similar to the ones posted above [except my machine appears to be about twice as fast for the first few iterations] – Mats Petersson Feb 19 '13 at 19:47
@DanielFischer Agreed on ``-O3``, something must be different in there, getting a factor of x4. – Jonas Schäfer Feb 19 '13 at 19:51
@DanielFischer Can you tell the processor/OS? Are you compiling as 64 bit or 32 bit? – Öö Tiib Feb 19 '13 at 20:02
@ÖöTiib It's a Core i5-2410M, I'm running 64-bit Linux. I don't think Linux/Windows would make a difference in principle for the code gcc generates, though 64-bit vs. 32 surely does (for the `-O3` version of `noIfs` at least). – Daniel Fischer Feb 19 '13 at 20:15
The problem with -O3 is that it generates a `conditional move (cmove)` which on x86 is usually a Very Bad Thing. See http://stackoverflow.com/questions/14805641/why-does-changing-const-ull-to-const-ull-in-function-parameter-result-in-pe/14819939#14819939. There is an open gcc problem report link in that SO answer - gcc developers have many reports of performance problems on x86 with -O3 generating conditional moves and are looking for how to fix it. – amdn Feb 19 '13 at 20:52
@amdn Interesting, thanks. Any idea _why_ a `cmove` is so bad on x86? – Daniel Fischer Feb 19 '13 at 21:04
@DanielFischer `cmove` is apparently only worth it when a conditional branch within a loop is badly mis-predicted. If the prediction is better than 92% it is better to use compare and branch. My guess is the micro-architecture of modern x86/AMD processors don't issue any speculative instructions until the `conditional move` is resolved... which means the branch at the bottom of the loop (`jne .L14` above) stalls -- you basically end up with a bubble in the pipeline (a stall) on every iteration of the loop - the gcc bug report link: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309 – amdn Feb 19 '13 at 21:33
@amdn If it was a complete stall, I'd expect a worse impact here, since the branch is trivially predictable and the loop body is small and cheap. But that's just gut-feeling. Anyway, thanks. I didn't know about the `cmove` issue before. Pondering if I should add `-fno-tree-loop-if-convert` to my gcc alias. – Daniel Fischer Feb 19 '13 at 21:42
Another possibility is that branch prediction uses a pattern history table and if the loop is very tight then both the `cmove` and the branch that closes the loop fall in the same slot and interfere with each other... so you end up with two stalls, one due to `cmove` misprediction and one due to the loop branch misprediction http://faculty.cse.tamu.edu/djimenez/pdfs/pldi05_dist.pdf – amdn Feb 19 '13 at 21:45
Does running 32 bit app on 64 bit Linux mean installing too lot of worthless crap? – Öö Tiib Feb 20 '13 at 20:14
@ÖöTiib Good question. I don't know what I'd need to install, at the moment, I can't compile 32-bit binaries. I have a 32-bit box, though, and I can compile and run it there to see (but that has an older gcc, I think 4.6). Going to take a few minutes. – Daniel Fischer Feb 20 '13 at 20:31
@ÖöTiib On my 32-bit box (gcc-4.6.2), things are slower, but a) with `-O1`, `unpredictableIfs` is also called only once, inlined, and run thrice with `-O2` and `-O3`; b) the code for both, `unpredictableIfs` and `noIfs` is identical to that from the 64-bit box with `-O1` and `-O2`; c) with `-O3`, the code for `unpredictableIfs` becomes a bit faster (no `cmove`, but a branch and jump over a move), while the code for `noIfs` is the same as for `-O2`. The timings fluctuate a bit more, and in some runs I get a ~35% difference between the last `unpredictableIfs` run and the others. – Daniel Fischer Feb 20 '13 at 21:05
Usually, the times are within normal fluctuation, though. – Daniel Fischer Feb 20 '13 at 21:07
I start to suspect that it is really something about how Intel emulates 32 bit processor on 64 bit processor. Sometimes it is in some favorable mode and sometimes it is not. Can't be something else, the body of loop is identical. Very good performance of noIfs on 64 bit comes likely from total lack of branches to mispredict. – Öö Tiib Feb 20 '13 at 23:11
There should be no mispredictions after the first few iterations anyway, the pattern is trivial to predict for all but the most naive branch predictors. The good performance for `noIfs` with `-O3` on 64 bits is because it handles four numbers at once, so has only a quarter of the iterations. – Daniel Fischer Feb 20 '13 at 23:22
true true false false ... seems to be worst pattern to predict. you can easily see it by ediding the pattern constant. that was the point discussed before i ran into the discovery. – Öö Tiib Feb 23 '13 at 02:51

score 2 · Answer 2 · answered Feb 19 '13 at 19:31

Right, looking at the assembler code from gcc on 64-bit Linux, the first case, with -O1, the function UnpredictableIfs is indeed called only once, and the result reused.

With -O2 and -O3, the functions are inlined, and the time it takes should be identical. There is also no actual branches in either bit of code, but the translation for the two bits of code is somewhat different, I've cut out the lines that update "sum" [in %edx in both cases]

UnpredictableIfs:

movl    %eax, %ecx
andl    $1073741826, %ecx
cmpl    $1, %ecx
adcl    $0, %edx
addl    $1, %eax

NoIfs:

xorl    %ecx, %ecx
testl   $1073741826, %eax
sete    %cl
addl    $1, %eax
addl    %ecx, %edx

As you can see, it's not quite identical, but it does very similar things.

Adrian McCarthy · Answer 3 · 2013-02-21T04:33:36.247

Regarding the range of results on Windows (from 1016 ms to 4797 ms): You should know that clock() in MSVC returns elapsed wall time. The standard says clock() should return an approximation of CPU time spent by the process, and other implementations do a better job of this.

Given that MSVC is giving wall time, if your process got pre-empted while running one iteration of the test, it could give a much larger result, even if the code ran in approximately the same amount of CPU time.

Also note that clock() on many Windows PCs has a pretty lousy resolution, often like 11-19 ms. You've done enough iterations that that's only about 1%, so I don't think it's part of the discrepancy, but it's good to be aware of when trying to write a benchmark. I understand you're going for portability, but if you needed a better measurement on Windows, you can use QueryPerformanceCounter which will almost certainly give you much better resolution, though it's still just elapsed wall time.

UPDATE: After I learned that the long runtime on the one case was happening consistently, I fired up VS2010 and reproduced the results. I was typically getting something around 1000 ms for some runs, 750 ms for others, and 5000+ ms for the inexplicable ones.

Observations:

In all cases the unpredictableIfs() code was inlined.
Removing the noIfs() code had no impact (so the long time wasn't a side effect of that code).
Setting thread affinity to a single processor had no effect.
The 5000 ms times were invariably the later instances. I noted that the later instances had an extra instruction before the beginning of the loop: lea ecx,[ecx]. I don't see why that should make a 5x difference. Other than that the early and later instances were identical code.
Removing the volatile from the start and stop variables yielded fewer long runs, more of the 750 ms runs, and no 1000 ms runs. (The generated loop code looks exactly the same in all cases now, not leas.)
Removing the volatile from the sum variable (but keeping it for the clock timers), the long runs can happen in any position.
If you remove all of the volatile qualifiers, you get consistent, fast (750 ms) runs. (The code looks identical to the earlier ones, but edi was chosen for sum instead of ecx.)

I'm not sure what to conclude from all this, except that volatile has unpredictable performance consequences with MSVC, so you should apply it only when necessary.

UPDATE 2: I'm seeing consistent runtime differences tied to the use of volatile, even though the disassembly is almost identical.

With volatile:

Puzzling measurements:
Unpredictable ifs took 643 msec; answer was 500000000
Unpredictable ifs took 1248 msec; answer was 500000000
Unpredictable ifs took 605 msec; answer was 500000000
Unpredictable ifs took 4611 msec; answer was 500000000
Unpredictable ifs took 4706 msec; answer was 500000000
Unpredictable ifs took 4516 msec; answer was 500000000
Unpredictable ifs took 4382 msec; answer was 500000000

The disassembly for each instance looks like this:

    start = clock();
010D1015  mov         esi,dword ptr [__imp__clock (10D20A0h)]  
010D101B  add         esp,4  
010D101E  call        esi  
010D1020  mov         dword ptr [start],eax  
    sum = unpredictableIfs();
010D1023  xor         ecx,ecx  
010D1025  xor         eax,eax  
010D1027  test        eax,40000002h  
010D102C  jne         main+2Fh (10D102Fh)  
010D102E  inc         ecx  
010D102F  inc         eax  
010D1030  cmp         eax,3B9ACA00h  
010D1035  jl          main+27h (10D1027h)  
010D1037  mov         dword ptr [sum],ecx  
    stop = clock();
010D103A  call        esi  
010D103C  mov         dword ptr [stop],eax

Without volatile:

Puzzling measurements:
Unpredictable ifs took 644 msec; answer was 500000000
Unpredictable ifs took 624 msec; answer was 500000000
Unpredictable ifs took 624 msec; answer was 500000000
Unpredictable ifs took 605 msec; answer was 500000000
Unpredictable ifs took 599 msec; answer was 500000000
Unpredictable ifs took 599 msec; answer was 500000000
Unpredictable ifs took 599 msec; answer was 500000000

    start = clock();
00321014  mov         esi,dword ptr [__imp__clock (3220A0h)]  
0032101A  add         esp,4  
0032101D  call        esi  
0032101F  mov         dword ptr [start],eax  
    sum = unpredictableIfs();
00321022  xor         ebx,ebx  
00321024  xor         eax,eax  
00321026  test        eax,40000002h  
0032102B  jne         main+2Eh (32102Eh)  
0032102D  inc         ebx  
0032102E  inc         eax  
0032102F  cmp         eax,3B9ACA00h  
00321034  jl          main+26h (321026h)  
    stop = clock();
00321036  call        esi
// The only optimization I see is here, where eax isn't explicitly stored
// in stop but is instead immediately used to compute the value for the
// printf that follows.

Other than register selection, I don't see a significant difference.

This is generally a problem that is turning tiny measurements useless. When measuring things that last a second it can also cause 1-2% fluctuations between tests. It can not explain 472% difference. The difference is repeatably such. Running the tests again I get often exactly same numbers. — Öö Tiib, Feb 20 '13 at 15:52
Removing volatile was bad idea. volatile forces to assign the results of function calls to variables in exactly same order as in code. Compiler can prove that the functions clock() and noIfs() are unrelated but still can not optimize timing away with volatile. There are no other effects of volatile. The tests that were measured contained nothing volatile so there was no way how it could affect their performance. — Öö Tiib, Feb 20 '13 at 20:02
I was with you until you said that volatile could not affect the performance, since it does, in fact, affect the performance in a consistent and dramatic way. — Adrian McCarthy, Feb 21 '13 at 00:24
No, it does not. Something else does affect performance. Lack of volatile just made it to sometimes place the calls to clock() to wrong places. I verified from disassembler. I had several tests more and just ordering them around caused unpredictableIfs() run sometimes 1300 ms sometimes 1000 ms and sometimes 4800 ms volatile or not, even human feels the difference. — Öö Tiib, Feb 21 '13 at 00:31
I've update my answer with more data, because it seems I'm seeing something slightly different than what you're describing. — Adrian McCarthy, Feb 21 '13 at 04:34
Yes, you have erased even the noIfs(). I had when i tested 2 more similar functions. Arranging them around in different patterns caused different results. unpredictableIfs() was the one that changed the most, others were more stable. — Öö Tiib, Feb 21 '13 at 04:41
I erased noIfs after several experiments showed that it made no difference. Also, I just tried again using 64-bit builds. With 64-bit the volatile is necessary to avoid optimization that messes up the timing, and the resulting times are *fairly* consistent around 1100 ms. — Adrian McCarthy, Feb 21 '13 at 04:46
You ended where I started. All were uniform. Then I noticed odd differences, then arranged, even more diverse differences, then noticed that lack of volatile causes sometimes wrong results, added volatile and arranged things so the differences manifested themself the most. Then posted. — Öö Tiib, Feb 21 '13 at 04:53
Just erasing first printf in 32 bit build may cause all go upside down again. — Öö Tiib, Feb 21 '13 at 05:02
I was curious about volatile and the code-reordering aspect, so I asked about it here: http://stackoverflow.com/questions/15041519/benchmarking-code-reordering-volatile — Adrian McCarthy, Mar 02 '13 at 15:59

Difficulties to measure C/C++ performance

3 Answers3

Linked