6

Is there a difference in performance between these mov load instructions? Do the more complex addressing modes have extra overhead (latency or throughput) compared to the simple ones?

# AT&T syntax                       # Intel syntax:
movq (%rsi), %rax                   mov  rax, [rsi]

movq (%rdi, %rsi), %rax             mov  rax, [rdi + rsi]

movq (%rdi, %rsi, 4), %rax          mov  rax, [rdi + rsi*4]
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
mosceo
  • 1,222
  • 11
  • 26
  • Some microarchitectures might take an extra cycle of latency for address generation (this not posting this as an answer because IDK the answer to this part) with complex addressing modes, but throughput shouldn't be affected. Choice of addressing mode does matter on Intel SnB-family for instructions other than `mov`-loads, because [indexed addressing modes can't micro-fuse](http://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes/31027695#31027695). So `mov %rax, (%rdi, %rsi)` is 2 uops, like `add (%rdi,%rsi), %rax`, but `mov (%rdi,%rsi), %rax` is 1 uop. – Peter Cordes Aug 16 '16 at 05:24
  • Do you have any specific microarchitecture in mind? e.g. Intel SnB/Haswell family? AMD Bulldozer-family? Intel Atom? – Peter Cordes Aug 16 '16 at 07:36
  • @PeterCordes Did not think about any particular architecture. The regular x86-64 family. – mosceo Aug 16 '16 at 12:32
  • All three of those microarchitecture families implement the x86-64 ISA, but have different performance characteristics. A few things are the same across different microarchitectures, but many things are different. – Peter Cordes Aug 16 '16 at 17:26

2 Answers2

4

Yes, there is an overhead for "complex addressing" on recent Intel CPUs. The cost is one additional cycle of latency (e.g., 5 cycles for a normal GP load using complex addressing versus 4 cycles with simple addressing).

Simple addressing is anything of the form [reg + offset] where the immediate offset between 0 and 2047 inclusive.

Complex addressing is anything other than simple addressing.

In particular any addressing mode with two registers like your examples [rdi + rsi] or [rdi + rsi*4] are complex addressing and cost an extra cycle.

There is an exceptional case: if the index register1 is zeroed via a zeroing idiom (like xor edi, edi, but not like mov edi, 0) you don't pay the complex addressing penalty.


1 The index register is the one multiplied by 1, 2, 4 or 8, i.e., rsi in [rdi + rsi*4]. In the case neither register shows a multiplier, like [rdi + rsi] the multiplier is 1 and you'll have to check your assembler to see how to specify which is the index and which is the displacement. nasm seems to use the second register as the index.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • 1
    Does that also avoid un-lamination for micro-fused loads? [Micro fusion and addressing modes](https://stackoverflow.com/q/26046634). Does that still apply for `[zerored_reg + rsi*4]`? (You could plausibly use that to stride through an input array 4x faster than an output array, but it's a stretch.) – Peter Cordes Aug 03 '18 at 05:43
  • This reminds me, I've been meaning to ask on SO how to construct a test-case that's sensitive to store-address latency, so we could measure whether `[rdi + 16]` is faster than `[rdi + 4096]` as a store address. Intel's optimization manual only mentions lower latency for loads with simple addressing modes. Last time I tried, I ran into trouble trying to create a dependency chain through a store/reload with extra `imul reg,1` for the address. – Peter Cordes Aug 03 '18 at 05:47
  • @PeterCordes - it's only the displacement as I expected, although the original answer text seemed to indicate otherwise. I fixed it. – BeeOnRope Aug 03 '18 at 18:53
  • @PeterCordes - well one could take the difficulty of measuring store latency as an indication that it's not likely to be very interesting in real code, since it's so hard to create code that relies on it (unlike load latency which trivially shows up all the time). To test store latency you are by definition relying on SLF, which we know works "weirdly" on modern Intel chips: exhibiting a longer latency if you try to use the result too soon, and a shorter latency if you use it at just the right time. I have some tests in uarch-bench (`--test-name=memory/store-fwd/*`) that show this. – BeeOnRope Aug 03 '18 at 19:00
  • @PeterCordes - I went ahead and modified that test: I didn't see any difference with the store forwarding pattern with an offset or not. You could still store forward as fast as 3 cycles if the load executes in exactly 3 cycles after the store. Note that in this test the dependency is purely through the load and store data, i.e., `mov rcx, [rdx + offset]; [rdx + offset], rcx`, so the address calculation isn't in the dependency chain. Maybe that's not what you want: maybe you need the address to be in the dep chain to see any extra latency. – BeeOnRope Aug 03 '18 at 19:10
  • Related: [Is there a penalty when base+offset is in a different page than the base?](https://stackoverflow.com/q/52351397) has more details about when that 0..2047 special case applies for load-use latency, and that it's speculative with a downside. – Peter Cordes Sep 16 '20 at 12:45
1

Depending on which specific CPU; mostly "no, there's no extra overhead". However...

Most CPUs have out-of-order cores, which means they perform instruction in whatever order is fastest and not in the order the instructions are given. For this to work, one instruction (e.g. movq (%rdi, %rsi, 4), %rax) can't happen until things it depended on are finished (e.g. the values in rdi and rsi are known).

For example, these 2 instructions can occur in parallel (because the second instruction doesn't depend on the first):

movq (%rdi), %edi
movq (%rsi), %rax

And these 2 instructions can't occur in parallel (the second instruction has to wait until the first instruction completes):

movq (%rdi), %rdi
movq (%rdi, %rsi), %rax

Also note that the bottleneck for a piece of code may not be execution. If the bottleneck is instruction fetch then larger instructions will be worse; if the bottleneck is instruction decode then more complex instructions can be worse; if the bottleneck is data cache bandwidth then anything that reads/writes to memory can be worse, etc.

Basically; you can't look at individual instructions in isolation and decide if they're better/worse. You have to look at entire sequences of multiple instructions so that you can know about any dependencies on previous instructions (and their latencies); and you have to know what the bottleneck is (e.g. from performance monitoring tools); and if you know all this then you can make an "educated guess" that's only really useful for a small number of CPUs (because different CPUs have different characteristics).

Brendan
  • 35,656
  • 2
  • 39
  • 66
  • You *can* compare the throughput and latency of single instructions, assuming all their inputs are ready. e.g. `lea (%rax, %rax), %ecx` has 1c latency and 2 per cycle throughput (Intel Haswell, using p1/p5), while `imul $2, %eax, %ecx` has 3c latency and 1 per cycle throughput (Haswell, p1 only). Both do the same thing, but you will always choose the `lea` because it's better in every way. (Agner Fog's tables say 3-component addresses for `lea` have 3c latency and less throughput, running on only p1. I used multiply by 2 as an example so code-size isn't an issue either.) – Peter Cordes Aug 16 '16 at 05:52
  • It is a fair point that using a different addressing mode means changing the surrounding code. Using an index as a loop variable instead of a pointer increment means you can count up towards zero, removing 1 uop from the loop. (base register = one past the end of the array, so you can exit the loop with `inc`/`jz` or something.) – Peter Cordes Aug 16 '16 at 05:56