The fact is that modern processors are complicated. All the instructions executed will interact with each other in complicated and interesting ways. Thanks for "that other guy" for posting the code.
Both OP and "that other guy" apparently found that the short loop takes 11 cycles, while the long one takes 9 cycles. For the long loop, 9 cycles is plenty of time even though there are lot of operations. For the short loop, there must be some stall caused by it being so short, and just adding a nop
makes the loop long enough to avoid the stall.
One thing that happens if we look at the code:
0x00000000004005af <+50>: addq $0x1,-0x20(%rbp)
0x00000000004005b4 <+55>: cmpq $0x7fffffff,-0x20(%rbp)
0x00000000004005bc <+63>: jb 0x4005af <main+50>
We read i
and write it back (addq
). We read it immediately again, and compare it (cmpq
). And then we loop. But the loop uses branch prediction. So at the time when the addq
is executed, the processor isn't really sure it is allowed to write to i
(because branch prediction could be wrong).
Then we compare to i
. The processor will try to avoid reading i
from memory, because reading it takes a long time. Instead some bit of hardware will remember that we just wrote to i
by adding to it, and instead of reading i
, the cmpq
instruction gets the data from the store instruction. Unfortunately, we are not sure at this point if the write to i
actually happened or not! So that could introduce a stall here.
The problem here is that the conditional jump, the addq
which leads to a conditional store, and the cmpq
which isn't sure where to get the data from, are all very very close together. They are unusually close together. It could be that they are so close together, the processor cannot figure out at this time whether to take i
from the store instruction or to read it from memory. And reads it from memory, which is slower because it has to wait for the store to finish. And adding just one nop
gives the processor enough time.
Usually you think that there is RAM, and there is cache. On a modern Intel processor, reading memory can read from (slowest to fastest):
- Memory (RAM)
- L3 cache (optional)
- L2 cache
- L1 cache
- Previous store instruction that hasn't written to L1 cache yet.
So what the processor does internally in the short, slow loop:
- Read
i
from L1 cache
- Add 1 to
i
- Write
i
to L1 cache
- Wait until
i
is written to L1 cache
- Read
i
from L1 cache
- Compare
i
with INT_MAX
- Branch to (1) if it is less.
In the long, fast, loop the processor does:
- Lots of stuff
- Read
i
from L1 cache
- Add 1 to
i
- Do a "store" instruction that will write
i
to L1 cache
- Read
i
directly from the "store" instruction without touching L1 cache
- Compare
i
with INT_MAX
- Branch to (1) if it is less.