Surprisingly, I found that (b) run ~10% faster than (a)
You probably tested on a CPU other than AMD Bulldozer-family or Ryzen, which have a fast loop
instruction. On other CPUs, loop
is very slow, mostly on purpose for historical reasons, so you bottleneck on it. e.g. 7 uops, one per 5c throughput on Haswell.
mov r64, imm64
is bad for uop cache throughput because of the large immediate taking 2 slots in Intel's uop cache. (See the Sandybridge uop cache section in Agner Fog's microarch pdf), and Which is faster, imm64 or m64 for x86-64? where I listed the details.
Even apart from that, it's not too surprising that 1 extra uop in the loop makes it run slower. You're probably not on an AMD CPU (with single-uop / 1 per 2 clock loop
), because the extra mov
in such a tiny loop would make more than 10% difference. Or no difference at all, since it's just 3 vs. 4 uops per 2 clocks, if that's correct that even tiny loop
loops are limited to one jump per 2 clocks.
On Intel, loop
is 7 uops, one per 5 clocks throughput on most CPUs, so the 4-per-clock issue/rename bottleneck won't be what you're hitting. loop
is micro-coded, so the front-end can't run from the loop buffer. (And Skylake CPUs have their LSD disabled by a microcode update to fix the partial-register erratum anyway.) So the mov r64,imm64
uop has to be re-read from the uop cache every time through the loop.
A load that hits in cache has very good throughput (2 loads per clock, and in this case micro-fusion means no extra uops to use a memory operand instead of register for cmp
). So the main penalty in using a constant from memory is the extra cache footprint and cache misses, but your microbenchmark won't reveal that at all. It also has no other pressure on the load ports.
In the general case:
If possible, use a RIP-relative lea
to generate 64-bit address constants.
e.g. lea rax, [rel addr64]
. Yes, this takes an extra instruction to get the constant into a register. (BTW, just use default rel
. You can use [abs fs:0]
if you need it.
You can avoid the extra instruction if you build position-dependent code with the default (small) code model, so static addresses fit in the low 32 bits of virtual address space and can be used as immediates. (Actually low 2GiB, so sign or zero extending both work). See 32-bit absolute addresses no longer allowed in x86-64 Linux? if gcc complains about absolute addressing; -pie
is enabled by default on most distros. This of course doesn't work in Linux shared libraries, which only support text relocations for 64-bit addresses. But you should avoid relocations whenever possible by using lea
to make position-indepdendent code.
Most integer build-time constants fit in 32 bits, so you can use cmp r64, imm32
or cmp r32, imm32
even in PIC code.
If you do need a 64-bit non-address constant, try to hoist the mov r64, imm64
out of a loop. Your cmp
loop would have been fine if the mov
wasn't inside the loop. x86-64 has enough registers that you (or the compiler) can usually avoid reloads inside inner-most loops in integer code.