How does RIP-relative addressing perform compared to mov reg, imm64?

Question

It is known fact that x86-64 instructions do not support 64-bit immediate values (except for mov). Hence, when migrating code from 32 to 64 bits, an instruction like this:

    cmp rax, addr32

cannot be replaced with the following:

    cmp rax, addr64

Under these circumstances, I'm considering two alternatives: (a) using a scratch register for loading the constant or (b) using rip-relative addressing. The two approaches look like this:

    mov r11, addr64 ; scratch register
    cmp rax, r11

ptr64: dq addr64

...
     cmp rax, [rel ptr64]    ; encoded as cmp rax, [rip+offset]

I wrote a very simple loop to compare the performance of both approaches (which I paste below). While (b) uses an indirect pointer, (a) has the the immediate encoded in the instruction (which could lead to a worse usage of i-cache). Surprisingly, I found that (b) run ~10% faster than (a). Is this result something to be expected in more common real-world code?

true:  dq 0xFFFF0000FFFF0000
false: dq 0xAAAABBBBAAAABBBB

main:
    or rax, 1  ; rax is odd and constant "true" is even
    mov rcx, 0x1
    shl rcx, 30
branch:
    mov r11, 0xFFFF0000FFFF0000 ; not present in (b)
    cmp rax, r11                ; vs cmp rax, [rel true]
    je next
    add rax, 2
    loop branch

next:
    mov rax, 0
    ret

`mov rcx, 0x1` / `shl rcx, 30` has *zero* advantage over `mov ecx, 1<<30`. Also, why do you want the starting value of `rax` to depend on whatever garbage the CRT startup code left in `rax`? — Peter Cordes, Jan 17 '18 at 17:22
Yes, I know, I was just lazy and didn't want to write a very big constant and the count the zeros :). As for `rax`, I wasn't interested in its value, but in the `je next` never to get taken. Having an unpredictable initial value somehow sounded as a good property in my mind, but I guess it doesn't matter and it's just confusing. — melkyades, Jan 17 '18 at 23:27
That's why you let the assembler make the constant for you, but writing literally `mov ecx, 1<<30`. That is valid NASM syntax: http://www.nasm.us/doc/nasmdoc3.html#section-3.5.4. Also, you could have simplified this a lot by using `cmp / jnz` as the loop branch. (And just choose your 64-bit constants so the loop runs in a reasonable amount of time.) An extra not-taken branch inside a loop has throughput consequences for tiny loops on some CPUs. https://stackoverflow.com/questions/47783926/why-are-loops-always-compiled-like-this. — Peter Cordes, Jan 17 '18 at 23:44

Peter Cordes · Answer 1 · 2018-01-17T17:32:58.707

Surprisingly, I found that (b) run ~10% faster than (a)

You probably tested on a CPU other than AMD Bulldozer-family or Ryzen, which have a fast loop instruction. On other CPUs, loop is very slow, mostly on purpose for historical reasons, so you bottleneck on it. e.g. 7 uops, one per 5c throughput on Haswell.

mov r64, imm64 is bad for uop cache throughput because of the large immediate taking 2 slots in Intel's uop cache. (See the Sandybridge uop cache section in Agner Fog's microarch pdf), and Which is faster, imm64 or m64 for x86-64? where I listed the details.

Even apart from that, it's not too surprising that 1 extra uop in the loop makes it run slower. You're probably not on an AMD CPU (with single-uop / 1 per 2 clock loop), because the extra mov in such a tiny loop would make more than 10% difference. Or no difference at all, since it's just 3 vs. 4 uops per 2 clocks, if that's correct that even tiny loop loops are limited to one jump per 2 clocks.

On Intel, loop is 7 uops, one per 5 clocks throughput on most CPUs, so the 4-per-clock issue/rename bottleneck won't be what you're hitting. loop is micro-coded, so the front-end can't run from the loop buffer. (And Skylake CPUs have their LSD disabled by a microcode update to fix the partial-register erratum anyway.) So the mov r64,imm64 uop has to be re-read from the uop cache every time through the loop.

A load that hits in cache has very good throughput (2 loads per clock, and in this case micro-fusion means no extra uops to use a memory operand instead of register for cmp). So the main penalty in using a constant from memory is the extra cache footprint and cache misses, but your microbenchmark won't reveal that at all. It also has no other pressure on the load ports.

In the general case:

If possible, use a RIP-relative lea to generate 64-bit address constants.
e.g. lea rax, [rel addr64]. Yes, this takes an extra instruction to get the constant into a register. (BTW, just use default rel. You can use [abs fs:0] if you need it.

You can avoid the extra instruction if you build position-dependent code with the default (small) code model, so static addresses fit in the low 32 bits of virtual address space and can be used as immediates. (Actually low 2GiB, so sign or zero extending both work). See 32-bit absolute addresses no longer allowed in x86-64 Linux? if gcc complains about absolute addressing; -pie is enabled by default on most distros. This of course doesn't work in Linux shared libraries, which only support text relocations for 64-bit addresses. But you should avoid relocations whenever possible by using lea to make position-indepdendent code.

Most integer build-time constants fit in 32 bits, so you can use cmp r64, imm32 or cmp r32, imm32 even in PIC code.

If you do need a 64-bit non-address constant, try to hoist the mov r64, imm64 out of a loop. Your cmp loop would have been fine if the mov wasn't inside the loop. x86-64 has enough registers that you (or the compiler) can usually avoid reloads inside inner-most loops in integer code.

Related: [How to load address of function or label into register](https://stackoverflow.com/q/57212012) / [How do RIP-relative variable references like "\[RIP + \_a\]" in x86-64 GAS Intel-syntax work?](https://stackoverflow.com/q/54745872) — Peter Cordes, May 22 '21 at 05:12

How does RIP-relative addressing perform compared to mov reg, imm64?

1 Answers1

In the general case:

Linked