5

Apparently, modern processors can tell if you do something stupid like moving a register to itself (mov %eax, %eax) and optimize that out. Trying to verify that claim, I ran the following program:

#include <stdio.h>
#include <time.h>

static inline void f1() {
   for (int i = 0; i < 100000000; i++)
      __asm__(
            "mov %eax, %eax;"
            "nop;"
            );
}

static inline void f2() {
   for (int i = 0; i < 100000000; i++)
      __asm__(
            "nop;"
            );
}

static inline void f3() {
   for (int i = 0; i < 100000000; i++)
      __asm__(
            "mov %ebx, %eax;"
            "nop;"
            );
}

int main() {
   int NRUNS = 10;
   clock_t t, t1, t2, t3;

   t1 = t2 = t3 = 0;
   for (int run = 0; run < NRUNS; run++) {
      t = clock(); f1(); t1 += clock()-t;
      t = clock(); f2(); t2 += clock()-t;
      t = clock(); f3(); t3 += clock()-t;
   }

   printf("f1() took %f cycles on avg\n", (float) t1/ (float) NRUNS);
   printf("f2() took %f cycles on avg\n", (float) t2/ (float) NRUNS);
   printf("f3() took %f cycles on avg\n", (float) t3/ (float) NRUNS);

   return 0;
}

This gives me:

f1() took 175587.093750 cycles on avg
f2() took 188313.906250 cycles on avg
f3() took 194654.296875 cycles on avg

As one expect, f3() comes out slowest. But surprisingly (to me at least), f1() is faster than f2(). Why is that?

Update: Compiling with -falign-loops gives qualitatively the same result:

f1() took 164271.000000 cycles on avg
f2() took 173783.296875 cycles on avg
f3() took 177765.203125 cycles on avg
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
BlenderBender
  • 524
  • 3
  • 17
  • 1
    That's weird for sure. Which processor have you tested this on? Which compiler did you use with what flags? Note further that `clock()` doesn't count cycles but rather `CLOCKS_PER_SEC`, so printing `%f cycles` is misleading. Also, try not to cast to `float` when you could use `double`. – fuz Oct 25 '18 at 15:15
  • I used gcc without any flags: `gcc test.c -o test`. The CPU is Intel(R) Core(TM) i5-4670S CPU @ 3.10GHz. – BlenderBender Oct 25 '18 at 15:17
  • 1
    Please try again with optimisations turned on as without optimisations, weird things might happen. Also try to change your code so each example runs for at least a second to avoid warm up or similar issues. Are you compiling for 32 bit mode or for 64 bit mode? – fuz Oct 25 '18 at 15:18
  • 4
    The test you are measuring is mostly loop overhead. For such a tiny loop body other factors may dominate, such as jump target alignment. – Raymond Chen Oct 25 '18 at 15:21
  • yea I think loop overhead is creating unreliable results too. – Afshin Oct 25 '18 at 15:25
  • @RaymondChen: What do you mean by "jump target misalignment"? – BlenderBender Oct 25 '18 at 15:26
  • @BlenderBender If a jump doesn't go to the beginning of a cache line, further cache lines have to be loaded earlier than normal, leading to a potential stall while loading the next cache line. – fuz Oct 25 '18 at 15:32
  • @fuz But wouldn't you expect this to be a random phenomenon? I mean, I get the same timings consistently. – BlenderBender Oct 25 '18 at 15:43
  • @BlenderBender The alignment is decided at assembly time, so if it effects the measurements it should be consistent. Try to compile with optimizations or at least `-falign-loops` to eliminate the possibility of this happening. – fuz Oct 25 '18 at 15:47
  • @fuz Updated my question. Even with loop alignment, `f1()` comes out significantly faster. – BlenderBender Oct 25 '18 at 15:49
  • 1
    Not an explanation, but I notice in your code sample you don't actually do `mov eax, eax` as you claim but rather `mov ebx, eax`. – 500 - Internal Server Error Oct 25 '18 at 15:55
  • 1
    @BlenderBender On my computer (Intel(R) Xeon(R) W-2133 CPU @ 3.60GHz) I get 142678.703125/142067.000000/143752.093750, without optimisations but with `-falign-loops`. With optimisations the code doesn't terminate because you overwrite `eax`. – fuz Oct 25 '18 at 15:59
  • Can you quote the part of the link that gave you the incorrect impression that `mov %eax, %eax;` or `nop` could be totally free? They both cost front-end throughput, and `mov %eax,%eax` isn't even a NOP in 64-bit mode (it zero-extends EAX into RAX, and because it's the same register mov-elimination fails on Intel CPUs.) See [Can x86's MOV really be "free"? Why can't I reproduce this at all?](https://stackoverflow.com/q/44169342) for more about front-end / back-end throughput bottlenecks vs. latency. – Peter Cordes Oct 26 '18 at 00:58
  • Adding independent instructions off the critical path dependency chain can make no difference to run-time, though, if your loop bottlenecks on latency. [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391). It can hurt when hyperthreading has 2 threads competing for one core, though, and takes up space in the out-of-order execution window so the CPU can't see as far ahead. – Peter Cordes Oct 26 '18 at 01:00
  • Un-optimized code can bottleenck on weird stuff, so yeah alignment could matter. [Adding a redundant assignment speeds up code when compiled without optimization](https://stackoverflow.com/q/49189685) and [Loop with function call faster than an empty loop](https://stackoverflow.com/a/45529487). I wonder if `mov %eax,%eax` is somehow interacting with the store-forwarding dependency chain for the loop counter here, and triggering the surprising Sandybridge-family effect where not attempting to reload right away can give lower store-forwarding latency. – Peter Cordes Oct 26 '18 at 01:02
  • Only 10 million iterations may not really long enough for a good test, though, especially with no warm-up first to get the CPU up to make clock speed. – Peter Cordes Oct 26 '18 at 01:06
  • @PeterCordes The part of the linked article that made me think that this can be optimized away is: "the move function takes care of checking for equivalent locations". – BlenderBender Oct 26 '18 at 11:33
  • @BlenderBender: that's talking about the `(move r x)` *function* in SBCL, not the x86 `mov` *instruction*. It's talking about optimization during code generation from that low-level intermediate language, not at runtime by the hardware. – Peter Cordes Oct 26 '18 at 11:37

1 Answers1

3

The part of the linked article that made me think that this can be optimized away is: "the move function takes care of checking for equivalent locations"

That's talking about the (move r x) function in SBCL, not the x86 mov instruction. It's talking about optimization during code generation from that low-level intermediate language, not at runtime by the hardware.

Neither mov %eax, %eax nor nop are totally free. They both cost front-end throughput, and mov %eax,%eax isn't even a NOP in 64-bit mode (it zero-extends EAX into RAX, and because it's the same register mov-elimination fails on Intel CPUs.)

See Can x86's MOV really be "free"? Why can't I reproduce this at all? for more about front-end / back-end throughput bottlenecks vs. latency.


You're probably seeing some side-effect of code-alignment, or maybe a funky Sandybridge-family store-forwarding latency effect like in Adding a redundant assignment speeds up code when compiled without optimization because you also compiled with optimization disabled, getting your compiler to make anti-optimized code for consistent debugging that keeps the loop counter in memory. (~6 cycle loop-carried dependency chain through store/reload instead of 1 iteration per clock for a normal tiny loop.)

If your results are reproducible with a larger iteration count, there's probably some microarchitectural explanation for what you're seeing, but it's probably not related to anything you were trying to measure.

Of course you'd also need to fix the mov %ebx, %eax; bug in f3 to compile successfully with optimization enabled. Clobbering EAX without telling the compiler will step on compiler-generated code. You didn't explain what you were trying to test with that, so IDK if it was a typo.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847