Since you compiled in debug mode, your loop probably stores/reloads k
to memory every iteration (and same for the loop counter). Tiny loops in un-optimized code usually bottleneck on store-forwarding latency (~5 cycles on Intel Haswell), not on any kind of throughput bottleneck.
My guess is that CPU is performing some kind of heuristics during the branch prediction and that in my specific case the my specific CPU was better at predicting this branch in the test1.
That doesn't make sense. Perfect branch prediction can reduce the cost of a branch to zero, not make it negative.
i.e. doing two ADD instructions + a branch can't be faster than just doing two ADD insns, unless there's something else different as well.
Presumably MSVC in debug mode ends up making less-bad code for test1 than for test2. Perhaps it keeps k
in a register across one of the increments, instead of doing both increments with a memory destination?
If you posted the asm, it would be possible to tell you what the compiler did differently that made the first version run faster, but it wouldn't be useful or interesting. (And has nothing to do with branch prediction; it should predict with at least 99.999% success, so the actual branch instruction is nearly free.)
Read Agner Fog's microarch pdf if you want to learn how CPUs work internally, and how to predict how fast a loop of assembly instructions will run.
See my answer to another question about optimizing for debug-mode about why it's meaningless and a waste of time, and not useful for learning about what's fast and what's slow.
Optimization doesn't just speed everything up by a constant factor, it changes what kind of bottlenecks you will experience. (e.g. k
will live in a register, so k+=1
only has the latency of an ADD instruction, not of a round-trip through memory). And of course combining into a k+=3
, and proving that the i!=0
branch only happens once, and peeling that iteration out.
More importantly, the result is a compile-time constant, so ideally both functions will compile to a single instruction that puts the result into the return-value register. Unfortunately, gcc6.2, clang3.9, and icc17 all fail to do that for version 1, so it looks like manually peeling out the first iteration is very helpful.
But for test2, all compilers have an easy time compiling it to movabs rax, 29999999998
/ ret
.
Related: What optimizing compilers do with these functions
If you want to look at compiler output, use function args instead of constants. You did already return the result, so that's good.
I put your functions up on the Godbolt Compiler explorer (which recently added a newer version of Intel's compiler than icc13).
clang-3.9 foolishly uses cmov
for test1, from -O1 to -O3. A branch that predictable would be better done with a jump, if it's not peeled entirely. (Perhaps an unsigned loop counter / upper bound would help the compiler figure out what's going, but I didn't try.)
gcc uses a conditional branch for test1, and a function-arg counter version of test2. The branching looks reasonable, with the i!=0 case being the fall-through not-taken side of the CMP/JE. But it still actually counts the loop counter up to the max, as well as incrementing k
.
In test2 gcc just runs up the loop counter and multiplies it outside the loop.
ICC17 unrolls test1 pretty aggressively, with multiple integer registers as accumulators so more than one ADD or INC can happen in parallel. IDK if it helps, because it spills one of its accumulators, with a memory-destination ADD. Actually I think its unrolled enough that the ~20 uop loop is only slowed down from 5 to 6 cycles per iteration by the bottleneck on a memory-destination ADD (on Intel Haswell), instead of the bottleneck on integer ALU ADD/INC instructions (4 per clock on Haswell). It does the +=2 and +=1 in different registers, and combines at the end (I assume, I just looked at the loop with the cmp rdx, 1249999999
end condition. It's amusing how aggressively it manages to optimize the loop without just optimizing it down to a constant like it does for test2.