65

While trying to know how long a line of C code used to execute, I noticed this weird thing :

int main (char argc, char * argv[]) {
    time_t begin, end;
    uint64_t i;
    double total_time, free_time;
    int A = 1;
    int B = 1;

    begin = clock();
    for (i = 0; i<(1<<31)-1; i++);
    end = clock();
    free_time = (double)(end-begin)/CLOCKS_PER_SEC;
    printf("%f\n", free_time);

    begin = clock();
    for (i = 0; i<(1<<31)-1; i++) {
        A += B%2;
    }
    end = clock();
    free_time = (double)(end-begin)/CLOCKS_PER_SEC;
    printf("%f\n", free_time);

    return(0);
}

Which when executed displays :

5.873425
4.826874

Why does the empty loop use more time than the second which has an instruction within ? Of course I've tried many variants but everytime, an empty loop takes more time than one with one single instruction within.

Note that I have tried swapping loops order and adding some warmup code and it didn't change my problem at all.

I'm using codeblocks as IDE with GNU gcc compiler, linux ubuntu 14.04 and have a quadcore intel i5 at 2.3GHz (I've tried running the programm on a single core, this doesn't change the result).

jfs
  • 399,953
  • 195
  • 994
  • 1,670
Celerio
  • 803
  • 6
  • 11
  • 2
    did you do this more than once, and were you running anything else? also what compiler options are you using? – Kevin L Jul 31 '14 at 20:08
  • 4
    What are your compiler options and what does the assembly have to say about this? – chris Jul 31 '14 at 20:08
  • 7
    How many C compilers do you think spend a lot of effort on optimizing empty loops? – Hot Licks Jul 31 '14 at 20:09
  • 15
    If you're interested in the performance of your code, ditch `clock()` and turn a genuine *profiler* loose on release-mode optimized bits. – WhozCraig Jul 31 '14 at 20:12
  • @HotLicks: If the code has no observable effects, I'd expect any compiler to just remove it. Keep in mind that an empty loop might not look empty in the source code; the loop body might have been stripped out in the preprocessing stage, e.g. if it's just some debug prints. – Thomas Jul 31 '14 at 20:13
  • 5
    Since the value of `A` is never used, it's likely the compiler made the second loop empty as well, in which case the question is why aren't the two loops identical. What happens to the timings if you print the value of `A` after the second loop? – chepner Jul 31 '14 at 20:14
  • 9
    It's an unrealistic, "toy" case, poorly measured. Nothing of any significance can be derived. – Hot Licks Jul 31 '14 at 20:15
  • As the number of iterations is fixed, there is probably some loop unrolling going on in the second case, which may not happen in the first one because the compiler didn't get the point of the loop. You may try using a variable for the count to see if that changes something. Another option to measure how long the operation takes would be to add another instruction in both loops. – jcaron Jul 31 '14 at 20:18
  • 2
    Btw `1 << 31` is an `int` expression. Your shift results in an overflow on platforms where `int` is 32bit. You're shifting into the sign bit which is UB. Change that to `(uint64-t)1 << 31`. – WhozCraig Jul 31 '14 at 20:19
  • 2
    Your title is very misleading. One instruction is very different than `A += B%2;`, which probably requires several instructions. – Fiddling Bits Jul 31 '14 at 20:34
  • @chepner: Printing `A` isn't enough to prevent optimization of the loop. `B` is not modified inside the loop, therefore `B%2` can be hoisted outside (and even resolved to a constant `1`). The compiler can also calculate the number of iterations, and replace iterated addition with multiplication (and even resolve to a constant). – Ben Voigt Jul 31 '14 at 20:42
  • @chepner: To elaborate slightly on Ben Voigt's point here, let's look at some assembly output again: https://gist.github.com/sharth/35c575127bd75862cfbc – Bill Lynch Jul 31 '14 at 20:46
  • Yeah, I underestimated how many optimizations might be available. – chepner Jul 31 '14 at 20:48
  • A similar effect is described here: http://eli.thegreenplace.net/2013/12/03/intel-i7-loop-performance-anomaly – jmiserez Aug 01 '14 at 01:41
  • 1
    What happens if you switch the positions of the loops? I.e., perform the second loop first, then perform the first loop second. Often I find taking simpler approaches can lead to broader insights. – Bryan W. Wagner Aug 01 '14 at 02:34
  • @BryanW.Wagner I had try switching positions, and the results don't change at all, the short loop remained slower and the big one faster, with approximately the same timings as before. – Celerio Aug 01 '14 at 10:12
  • 2
    @FiddlingBits : You're right, I edited this. – Celerio Aug 01 '14 at 10:22
  • Aside: the return value of `clock` is of type `clock_t`, not `time_t`. –  Aug 03 '14 at 10:52

4 Answers4

78

Assuming that your code uses a 32 bit integer int type (which your system probably does), then nothing can be determined from your code. Instead, it exhibits undefined behavior.

foo.c:5:5: error: first parameter of 'main' (argument count) must be of type 'int'
int main (char argc, char * argv[]) {
    ^
foo.c:13:26: warning: overflow in expression; result is 2147483647 with type 'int' [-Winteger-overflow]
    for (i = 0; i<(1<<31)-1; i++);
                         ^
foo.c:19:26: warning: overflow in expression; result is 2147483647 with type 'int' [-Winteger-overflow]
    for (i = 0; i<(1<<31)-1; i++) {
                         ^

Let's try to fix that:

#include <stdint.h>
#include <stdio.h>
#include <time.h>
#include <limits.h>

int main (int argc, char * argv[]) {
    time_t begin, end;
    uint64_t i;
    double total_time, free_time;
    int A = 1;
    int B = 1;

    begin = clock();
    for (i = 0; i<INT_MAX; i++);
    end = clock();
    free_time = (double)(end-begin)/CLOCKS_PER_SEC;
    printf("%f\n", free_time);

    begin = clock();
    for (i = 0; i<INT_MAX; i++) {
        A += B%2;
    }
    end = clock();
    free_time = (double)(end-begin)/CLOCKS_PER_SEC;
    printf("%f\n", free_time);

    return(0);
}

Now, let's look at the assembly output of this code. Personally, I find LLVM's internal assembly very readable, so I'm going to show that. I'll produce it by running:

clang -O3 foo.c -S -emit-llvm -std=gnu99

Here's the relevant portion of the output (the main function):

define i32 @main(i32 %argc, i8** nocapture readnone %argv) #0 {
  %1 = tail call i64 @"\01_clock"() #3
  %2 = tail call i64 @"\01_clock"() #3
  %3 = sub nsw i64 %2, %1
  %4 = sitofp i64 %3 to double
  %5 = fdiv double %4, 1.000000e+06
  %6 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([4 x i8]* @.str, i64 0, i64 0), double %5) #3
  %7 = tail call i64 @"\01_clock"() #3
  %8 = tail call i64 @"\01_clock"() #3
  %9 = sub nsw i64 %8, %7
  %10 = sitofp i64 %9 to double
  %11 = fdiv double %10, 1.000000e+06
  %12 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([4 x i8]* @.str, i64 0, i64 0), double %11) #3
  ret i32 0
}

Note that there are no operations between the calls to clock() for either case. So they are both compiled to the exact same thing.

Bill Lynch
  • 80,138
  • 16
  • 128
  • 173
  • 15
    I have no idea who was clueless enough to downvote you for pointing out undefined behavior; nevertheless, this answer could be improved by showing the code and compiler command line which generated that, instead of ambiguous "If we fix these issues" – Ben Voigt Jul 31 '14 at 20:22
  • 1
    @BenVoigt: Yeah. I was trying to keep the answer not too long. – Bill Lynch Jul 31 '14 at 20:23
  • 2
    At least tell what you did to "fix" the UB. Did you use `INT_MAX` ? `0x7FFFFFFF` ? And the compile options. – Ben Voigt Jul 31 '14 at 20:24
  • @BenVoigt: I've made some adjustments. – Bill Lynch Jul 31 '14 at 20:25
  • 1
    Too bad I can't give a second upvote. (BTW, showing the new version of the for loop would have been enough to explain the "fix", if you wanted to keep it short.) – Ben Voigt Jul 31 '14 at 20:26
  • Same goes for the GCC asm output I'm afraid. https://gist.github.com/sevki/973e91d83e3feacf44ed – Sevki Jul 31 '14 at 20:42
  • Taking this code, compiling *without* optimization in VS2013, I get the reported behavior. If I switch the order of the loop tests I still get the empty loop slower. Running on: Win7, Intel Xeon E3-1240. – Keith Aug 01 '14 at 02:59
  • @sharth sorry my question is probably stupid but I don't understand anything in assembly : if there are no operations between the calls of clock(), why do I get different times ? Oh and sorry for not fixing my code before posting, that was stupid. Thanks for correcting. – Celerio Aug 01 '14 at 10:40
  • @Celerio: Presumably, you were compiling your code without optimizations enabled. Or, you could have been using a different compiler that did not employ all of the optimizations that clang used to acquire this answer. – Bill Lynch Aug 01 '14 at 14:38
  • @Celerio Alternatively, some of the other posts talk about how your processor itself could be responsible for this behavior. My answer is strictly to point out what a compiler should be producing for this code, which is zero instructions. – Bill Lynch Aug 01 '14 at 14:50
  • wanted to upvote, then noticed you were at 42 upvotes :P – Ven Aug 01 '14 at 14:58
  • 6
    -1: I still don't know why the empty loop is slower. Also, you modified the code, used `-O3` and a different compiler. And as it turns out from this [other answer](http://stackoverflow.com/a/25069078/202504), the undefined behaviour is *not* responsible for the performance difference. – jmiserez Aug 01 '14 at 22:11
  • 2
    @jmiserez: Of course he used `-O3`. The question didn't say otherwise, and enabling optimization is the most sensible thing to do when measuring performance. (We can debate whether `-O3` or `-O2` or `-Os` is best, but the "right" level to explain what the OP observed is what the OP used, and the question didn't reveal that) – Ben Voigt Aug 04 '14 at 16:19
  • 2
    @jmiserez: Of your points, the only one I can take seriously is that I used a different compiler. So I just did a test with gcc 4.8.2 and gcc 4.9.0, and other than it reporting __another__ piece of undefined behavior that I missed, it also optimizes the loop into a nop. – Bill Lynch Aug 04 '14 at 17:16
45

The fact is that modern processors are complicated. All the instructions executed will interact with each other in complicated and interesting ways. Thanks for "that other guy" for posting the code.

Both OP and "that other guy" apparently found that the short loop takes 11 cycles, while the long one takes 9 cycles. For the long loop, 9 cycles is plenty of time even though there are lot of operations. For the short loop, there must be some stall caused by it being so short, and just adding a nop makes the loop long enough to avoid the stall.

One thing that happens if we look at the code:

0x00000000004005af <+50>:    addq   $0x1,-0x20(%rbp)
0x00000000004005b4 <+55>:    cmpq   $0x7fffffff,-0x20(%rbp)
0x00000000004005bc <+63>:    jb     0x4005af <main+50>

We read i and write it back (addq). We read it immediately again, and compare it (cmpq). And then we loop. But the loop uses branch prediction. So at the time when the addq is executed, the processor isn't really sure it is allowed to write to i (because branch prediction could be wrong).

Then we compare to i. The processor will try to avoid reading i from memory, because reading it takes a long time. Instead some bit of hardware will remember that we just wrote to i by adding to it, and instead of reading i, the cmpq instruction gets the data from the store instruction. Unfortunately, we are not sure at this point if the write to i actually happened or not! So that could introduce a stall here.

The problem here is that the conditional jump, the addq which leads to a conditional store, and the cmpq which isn't sure where to get the data from, are all very very close together. They are unusually close together. It could be that they are so close together, the processor cannot figure out at this time whether to take i from the store instruction or to read it from memory. And reads it from memory, which is slower because it has to wait for the store to finish. And adding just one nop gives the processor enough time.

Usually you think that there is RAM, and there is cache. On a modern Intel processor, reading memory can read from (slowest to fastest):

  1. Memory (RAM)
  2. L3 cache (optional)
  3. L2 cache
  4. L1 cache
  5. Previous store instruction that hasn't written to L1 cache yet.

So what the processor does internally in the short, slow loop:

  1. Read i from L1 cache
  2. Add 1 to i
  3. Write i to L1 cache
  4. Wait until i is written to L1 cache
  5. Read i from L1 cache
  6. Compare i with INT_MAX
  7. Branch to (1) if it is less.

In the long, fast, loop the processor does:

  1. Lots of stuff
  2. Read i from L1 cache
  3. Add 1 to i
  4. Do a "store" instruction that will write i to L1 cache
  5. Read i directly from the "store" instruction without touching L1 cache
  6. Compare i with INT_MAX
  7. Branch to (1) if it is less.
Community
  • 1
  • 1
gnasher729
  • 51,477
  • 5
  • 75
  • 98
  • Thanks for this very complete answer. As I never did any assembly code, this level of details really helps. You should put a link from your answer to OP and "that other guy" answers :) – Celerio Aug 01 '14 at 10:27
  • 1
    Answers below seem to invalidate this one. Also main point of this answer is only guessing "It could be that". – BartoszKP Aug 02 '14 at 11:42
  • 1
    Since "above" and "below" is quite relative, please explain how this answer is invalidated. – gnasher729 Aug 02 '14 at 15:35
  • 1
    http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf explains various types of stalls that you can incur and how to avoid them. Basically, the NOP prevents a resource stall by giving the processor time to commit the changes to the register. The cache isn't necessarily relevant here, but the stalls are real; writing to a register is still writing to some type of memory that still has a finite write speed, and will cause stalls when the CPU notices that an instruction needs a register already in use. – sfdcfox Aug 02 '14 at 18:17
  • Is the optimizer smart enough to get good scheduling? The conclusion the OP derived from this effect is unwarranted unless the optimizer generates code that suffers from it. – Ben Voigt Aug 04 '14 at 16:20
30

This answer assumes that you've already understood and addressed the excellent points regarding undefined behavior sharth makes in his answer. He also points out tricks that the compiler may play on your code. You should take steps to make sure the compiler doesn't recognize the entire loop as useless. For example, changing the iterator declaration to volatile uint64_t i; will prevent removal of the loop, and volatile int A; will ensure that the second loop actually does more work than the first. But even if you do all of that, you may still discover that:

Code later in a program may well execute more quickly than earlier code.

The clock() library function could have caused an icache miss after reading the timer, and before returning. This would cause some extra time in the first measured interval. (For later calls, the code is already in cache). However this effect will be tiny, certainly too small for clock() to measure, even if it was a page fault all the way to disk. Random context switches can add to either time interval.

More importantly, you have an i5 CPU, which has dynamic clocking. When your program begins execution, the clock rate is mostly likely low, because the CPU has been idle. Just running the program makes the CPU no longer idle, so after a short delay the clock speed will increase. The ratio between idle and TurboBoosted CPU clock frequency can be significant. (On my ultrabook's Haswell i5-4200U, the former multiplier is 8, and the latter is 26, making startup code run less than 30% as rapidly as later code! "Calibrated" loops for implementing delays are a terrible idea on modern computers!)

Including a warmup phase (running a benchmark repeatedly, and throwing the first result away) for more precise timing is not only for managed frameworks with JIT compilers!

Community
  • 1
  • 1
Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • 2
    I had already tried adding some warmup phase, or even switching the positions of the loops without success. I should have put this in my answer, I'll edit. – Celerio Aug 01 '14 at 10:33
  • Actually, I don't believe it's fair to give warmup time to managed frameworks and JIT compilers ;) unless they never see a reboot and ramp up very quickly. Like it's not fair to give them free VM loading time, etc. because those may very well be non-negligible delays. Daily reboot, very slow uncached behavior, slow caching, so many things can make the "warmed up" state unrealistic. – Morg. Aug 02 '14 at 10:11
  • 1
    @Morg.: This isn't about being "fair" to differences between frameworks, this is about testing which variation on code for a particular language/framework gets the best results. – Ben Voigt Aug 02 '14 at 15:28
  • @BenVoigt My bad. On that topic then: which variation on code for a particular language/framework gets the best results also depends on whether that variation needs no warmup, realistic warmup or unrealistic warmup. It's very easy to cache everything and then pretend one variation is faster than all others. Good thing this is about C though, I just thought your sentence in bold had implications that would make a benchmark less relevant rather than more. – Morg. Aug 04 '14 at 16:02
  • @Morg.: Warmup is the easiest way to put all variants on a level playing field. If warm execution isn't the comparison you're interested in, then my all means, do measure startup time and cache misses, but make sure you incur them for all versions. You may start the executable from scratch, but that doesn't mean the data files aren't in disk cache, etc. It really is quite complicated to measure "cold" run time. – Ben Voigt Aug 04 '14 at 16:14
  • @BenVoigt Warmup is the easiest way to disregard any effects resulting from real world conditions rather than "best case" optimistic lab conditions. The kind of benchmarking you suggest is mostly a way to figure out whether one ASM output is more efficient than another in the context of short pieces of code that run very often and usually have no side-effects. It's an important part of benchmarking but it's not applicable to more complex programs or full applications, for which side-effects and other environmental characteristics are a lot more important than raw lab speed. – Morg. Aug 05 '14 at 07:24
  • @Morg: Warmup doesn't prevent cache misses and other real world effects from slowing code down, what it does is focus on the cache misses caused by the code itself. And the code snippets being compared definitely don't have to be short. But yes, any synthetic testing has limitations. To get "real" results, you need to average "real" usage over thousands and thousands of runs, because "real" runs are not at all consistent. – Ben Voigt Aug 05 '14 at 12:59
27

I am able to reproduce this with GCC 4.8.2-19ubuntu1 with no optimization:

$ ./a.out 
4.780179
3.762356

Here is the empty loop:

0x00000000004005af <+50>:    addq   $0x1,-0x20(%rbp)
0x00000000004005b4 <+55>:    cmpq   $0x7fffffff,-0x20(%rbp)
0x00000000004005bc <+63>:    jb     0x4005af <main+50>

And here's the non-empty one:

0x000000000040061a <+157>:   mov    -0x24(%rbp),%eax
0x000000000040061d <+160>:   cltd   
0x000000000040061e <+161>:   shr    $0x1f,%edx
0x0000000000400621 <+164>:   add    %edx,%eax
0x0000000000400623 <+166>:   and    $0x1,%eax
0x0000000000400626 <+169>:   sub    %edx,%eax
0x0000000000400628 <+171>:   add    %eax,-0x28(%rbp)
0x000000000040062b <+174>:   addq   $0x1,-0x20(%rbp)
0x0000000000400630 <+179>:   cmpq   $0x7fffffff,-0x20(%rbp)
0x0000000000400638 <+187>:   jb     0x40061a <main+157>

Let's insert a nop in the empty loop:

0x00000000004005af <+50>:    nop
0x00000000004005b0 <+51>:    addq   $0x1,-0x20(%rbp)
0x00000000004005b5 <+56>:    cmpq   $0x7fffffff,-0x20(%rbp)
0x00000000004005bd <+64>:    jb     0x4005af <main+50>

They now run equally fast:

$ ./a.out 
3.846031
3.705035

I imagine this shows the importance of alignment, but I'm afraid I can't specifically say how :|

that other guy
  • 116,971
  • 11
  • 170
  • 194
  • Have you added warmup and averaging across multiple iterations? (The need is pointed out in my answer) – Ben Voigt Jul 31 '14 at 22:21
  • @BenVoigt Yes. It has no effect. – that other guy Jul 31 '14 at 22:38
  • A similar effect is described here: http://eli.thegreenplace.net/2013/12/03/intel-i7-loop-performance-anomaly – jmiserez Aug 01 '14 at 01:40
  • 3
    @PlasmaHH More than anything else it shows that an empty loop is slower than a loop with one empty instruction. Maybe you missed the point? – kritzikratzi Aug 01 '14 at 15:06
  • 1
    @kritzikratzi: maybe you are missing the point that it is meaningless to assess the speed of assembly *that is not generated to be fast* – PlasmaHH Aug 01 '14 at 15:15
  • 2
    @kritzikratzi: Pipeline scheduling is an optimizer function, just as much as hoisting independent subexpressions outside the loop. If you disable optimization to avoid loop unrolling, etc., you're also disabling pipeline scheduling. – Ben Voigt Aug 01 '14 at 15:55
  • @BenVoigt oooh, if you mean the optimization flag to the compiler: because you get an odd result without optimization flags doesn't imply that it goes away when you add the flag (it would be funny if logic worked that way). in fact, the link by jmiserez seems to say that using -O2 and -O3 produce the same results. – kritzikratzi Aug 01 '14 at 17:42
  • 2
    In this case, -O2 and -O3 both optimize away the loops. With -O, both loops become `mov $0x7fffffff, %edx; L: sub $0x1,%rdx; jne L` and take 0.6s. – that other guy Aug 01 '14 at 17:50
  • 4
    Why can't I down vote PlasmaHH's comment? This is about learning what happens inside a processor, and we have a case where a loop with three assembly instructions runs slower than one with the same instructions and a nop added, which runs at the same speed as a loop with seven more instructions added. If I was writing compilers, I would find this code extraordinarily interesting. I'm not writing compilers, I still find it very, very interesting. – gnasher729 Aug 01 '14 at 18:17
  • If I was writing compilers, especially optimising compilers, I'd really want to understand what's happening here and sometimes add a nop instruction to a loop to make it faster. – gnasher729 Aug 01 '14 at 18:23
  • 1
    @gnasher729: Because the question is about *which C code is faster*. Comparing non-optimized translations is useless *toward that end*. Looking at which assembly code is faster would not be meaninglessless, but it would answer a different question. – Ben Voigt Aug 05 '14 at 13:02