4

Consider the following code:

void foo(int* __restrict__ a)
{
    int i; int val = 0;
    for (i = 0; i < 100; i++) {
        val = 2 * i;
        a[i] = val;
    }
}

This complies (with maximum optimization but no unrolling or vectorization) into...

GCC 7.2:

foo(int*):
        xor     eax, eax
.L2:
        mov     DWORD PTR [rdi], eax
        add     eax, 2
        add     rdi, 4
        cmp     eax, 200
        jne     .L2
        rep ret

clang 5.0:

foo(int*): # @foo(int*)
  xor eax, eax
.LBB0_1: # =>This Inner Loop Header: Depth=1
  mov dword ptr [rdi + 2*rax], eax
  add rax, 2
  cmp rax, 200
  jne .LBB0_1
  ret

What are the pros and cons of GCC's vs clang's approach? i.e. an extra variable incremented separately, vs multiplying via a more complex addressing mode?

Notes:

  • This question also relates to this one with about the same code, but with float's rather than int's.
einpoklum
  • 118,144
  • 57
  • 340
  • 684
  • @harold: Ok, fixed that. – einpoklum Jan 20 '18 at 10:06
  • 1
    It's not a multiplication of course, but a shift. Used to be performed for free by the dedicated AGU. I would be surprised if newer CPUs have made the performance worse. – Bo Persson Jan 20 '18 at 11:25
  • @BoPersson: So the scale factor can only be a power-of-2? Also, if you're saying that, for this reason, it's better to do the loop this way, please say so in an answer. – einpoklum Jan 20 '18 at 12:03
  • The scale factor can only be 2, 4, or 8 and was designed specifically for array indexing in languages like C. My experience goes back to when there was a separate address generation unit and these calculations were done "on the side", for free. I don't know enough details about the current CPUs, but would trust Peter Cordes' answer. :-) – Bo Persson Jan 20 '18 at 12:25
  • 1
    The scale factor can also be 1 or 0, though 0 mostly acts differently. 1 is still a scale. – harold Jan 20 '18 at 13:08
  • @harold: It's a 2-bit field in the SIB byte. The *shift-count* can be 0..3, i.e. the multiplier can be 1, 2, 4, or 8. Shift-count = 0 is different from 1..2 in a few cases on a few CPUs (like LEA on AMD). But encoding "there's no index" (even when there is a SIB byte) is done differently: index=RSP actually means no index, which makes `[rsp]` possible (without needing a disp32) but `[rsp + rsp]` impossible. I'm not sure if you meant that "no index" was special, or that shift-count of 0 was special. (Because they both are, but no-index is more special more of the time). – Peter Cordes Jan 21 '18 at 03:57

2 Answers2

4

Yes, take advantage of the power of x86 addressing modes to save uops, in cases where an index doesn't unlaminate into more extra uops than it would cost to do pointer increments.

(In many cases unrolling and using pointer increments is a win because of unlamination on Intel Sandybridge-family, but if you're not unrolling or if you're only using mov loads instead of folding memory operands into ALU ops for micro-fusion, then indexed addressing modes are often break even on some CPUs and a win on others.)

It's essential to read and understand Micro fusion and addressing modes if you want to make optimal choices here. (And note that IACA gets it wrong, and doesn't simulate Haswell and later keeping some uops micro-fused, so you can't even just check your work by having it do static analysis for you.)


Indexed addressing modes are generally cheap. At worst they cost one extra uop for the front-end (on Intel SnB-family CPUs in some situations), and/or prevent a store-address uop from using port7 (which only supports base + displacement addressing modes). See Agner Fog's microarch pdf, and also David Kanter's Haswell write-up, for more about the store-AGU on port7 which Intel added in Haswell.
On Haswell+, if you need your loop to sustain more than 2 memory ops per clock, then avoid indexed stores.

At best they're free other than the code-size cost of the extra byte in the machine-code encoding. (Having an index register requires a SIB (Scale Index Base) byte in the encoding).

More often the only penalty is the 1 extra cycle of load-use latency vs. a simple [base + 0-2047] addressing mode, on Intel Sandybridge-family CPUs.

It's usually only worth using an extra instruction to avoid an indexed addressing mode if you're going to use that addressing mode in multiple instructions. (e.g. load / modify / store).


Scaling the index is free (on modern CPUs at least) if you're already using a 2-register addressing mode. For lea, Agner Fog's table lists AMD Ryzen as having 2c latency and only 2 per clock throughput for lea with scaled-index addressing modes (or 3-component), otherwise 1c latency and 0.25c throughput. e.g. lea rax, [rcx + rdx] is faster than lea rax, [rcx + 2*rdx], but not by enough to be worth using extra instructions instead.) Ryzen also doesn't like a 32-bit destination in 64-bit mode, for some reason. But the worst-case LEA is still not bad at all. And anyway, mostly unrelated to address-mode choice for loads, because most CPUs (other than in-order Atom) run LEA on the ALUs, not the AGUs used for actual loads/stores.

The main question is between one-register unscaled (so it can be a "base" register in the machine-code encoding: [base + idx*scale + disp]) or two-register. Note that for Intel's micro-fusion limitations, [disp32 + idx*scale] (e.g. indexing a static array) is an indexed addressing mode.


Neither function is totally optimal (even without considering unrolling or vectorization), but clang's looks very close.

The only thing clang could do better is save 2 bytes of code size by avoiding the REX prefixes with add eax, 2 and cmp eax, 200. It promoted all the operands to 64-bit because it's using them with pointers and I guess proved that the C loop doesn't need them to wrap, so in asm it uses 64-bit everywhere. This is pointless; 32-bit operations are always at least as fast as 64, and implicit zero-extension is free. But this only costs 2 bytes of code-size, and costs no performance other than indirect front-end effects from that.

You've constructed your loop so the compiler needs to keep a specific value in registers and can't totally transform the problem into just a pointer-increment + compare against an end pointer (which compilers often do when they don't need the loop variable for anything except array indexing).

You also can't transform to counting a negative index up towards zero (which compilers never do, but reduces the loop overhead to a total of 1 macro-fused add + branch uop on Intel CPUs (which can fuse add + jcc, while AMD can only fuse test or cmp / jcc).

Clang has done a good job noticing that it can use 2*var as the array index (in bytes). This is a good optimization for tune=generic. The indexed store will un-laminate on Intel Sandybridge and Ivybridge, but stay micro-fused on Haswell and later. (And on other CPUs, like Nehalem, Silvermont, Ryzen, Jaguar, or whatever, there's no disadvantage.)

gcc's loop has 1 extra uop in the loop. It can still in theory run at 1 store per clock on Core2 / Nehalem, but it's right up against the 4 uops per clock limit. (And actually, Core2 can't macro-fuse the cmp/jcc in 64-bit mode, so it bottlenecks on the front-end).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • What do you mean by "un-laminate" as opposed to "stay micro-fused"? – einpoklum Jan 20 '18 at 12:30
  • See the first link in my answer for the full details on micro-fusion and addressing modes: https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes. (And Agner Fog's microarch guide for more background to understand that Q&A if necessary). – Peter Cordes Jan 20 '18 at 13:15
  • 1
    I think you'd do well here to put more of a focus on the possible downside of de-lamination, perhaps with a link to your canonical answer on that topic for those that want to dig further. What I get from this is "yes, use indexed addressing modes" - which used to be my strategy too: I considered them more-or-less free if the extra 1 cycle latency didn't matter (so it was often worth it to do all sorts of index transformations in a loop to get everything using indexed and reduce the number of index/counter/pointer increments). – BeeOnRope Jan 20 '18 at 17:15
  • Now that I understand about de-lamination, the opposite is often true: the best strategy is often a small unroll and separate index/counter regs for difference accesses, allowing them to use simple addressing modes. In the fused domain, you pay 1 uop for the increment per access stream, but save N uops (N being the unroll factor) by avoiding de-lamination. Now it doesn't apply to not-unrolled loops or plain `mov`, and yes it makes it more complicated, but since this sounds general it would be good to have a pointer at least to the details. – BeeOnRope Jan 20 '18 at 17:17
  • In principle, it would still be possible to do counting towards zero from positive side. It would be legal for compiler to reorder assignments to start from high indexes and go down, as there is no observable behavior in the loop. However, I see reasons for not doing that. – zch Oct 24 '18 at 23:13
  • @zch: some hardware prefetching works better for ascending instead of descending (like maybe the L1d prefetcher in Sandybridge-family). The L2 streamer can handle either direction, though. Yes in principle that transformation is allowed, but no compilers I'm aware of are that aggressive. And they probably don't want to be, because they're not smart enough to know when or when not to do it. – Peter Cordes Oct 24 '18 at 23:14
  • @Peter - When did Intel add Scaled Indexed Addressing Mode? Can I count on it from Pentium Pro (i686) forward? Does Clang accept it with modern compilers? (I see Harold said something about Netburst, and I believe that was a Pentium/i586 feature). – jww Apr 30 '19 at 16:47
  • @jww: The encoding for 32-bit addressing modes has never changed since they were introduced with 386. [Why don't x86 16-bit addressing modes have a scale factor, while the 32-bit version has it?](//stackoverflow.com/q/55657904) and [Referencing the contents of a memory location. (x86 addressing modes)](//stackoverflow.com/q/34058101). I'm not aware of any clang limitation in asm syntax. – Peter Cordes Apr 30 '19 at 16:51
  • OK, thanks. Let me ask a question about Clang's inability to compile yet another program... – jww Apr 30 '19 at 16:53
2

Indexed addressing (in loads and stores, lea is different still) has some trade-offs, for example

  • On many µarchs, instructions that use indexed addressing have a slightly longer latency than instruction that don't. But usually throughput is a more important consideration.
  • On Netburst, stores with a SIB byte generate an extra µop, and therefore may cost throughput as well. The SIB byte causes an extra µop regardless of whether you use it for indexes addressing or not, but indexed addressing always costs the extra µop. It doesn't apply to loads.
  • On Haswell/Broadwell (and still in Skylake/Kabylake), stores with indexed addressing cannot use port 7 for address generation, instead one of the more general address generation ports will be used, reducing the throughput available for loads.

So for loads it's usually good (or not bad) to use indexed addressing if it saves an add somewhere, unless they are part of a chain of dependent loads. For stores it's more dangerous to use indexed addressing. In the example code it shouldn't make a large difference. Saving the add is not really relevant, ALU instructions wouldn't be the bottleneck. The address generation happening in ports 2 or 3 doesn't matter either since there are no loads.

harold
  • 61,398
  • 6
  • 86
  • 164
  • SKL loads can use port7? I haven't read that anywhere. That would require port7 to handle the load-data part of the uop, not just the address (and writing that address into the memory-order buffer). – Peter Cordes Jan 20 '18 at 11:32
  • @PeterCordes just the address generation perhaps? (though idk how that would work then) Wikichip lists it as just AGU. Agner Fogs table of execution units lists it as "load and store address generation" instead of just "store address generation" – harold Jan 20 '18 at 11:37
  • Huh, I'll check with perf counters if I get around to it. But loads are definitely still single uop, so it wouldn't make sense to have a port that didn't do the whole job. However I could see having a load/store port that only handled scalar integer loads without sign or zero extension, and only base + disp addressing modes. So it wouldn't have to handle broadcast-loads, `vmovddup`, or vector loads in general. – Peter Cordes Jan 20 '18 at 11:40
  • Can you define "SIB byte", "port 7" or link to a definition? – einpoklum Jan 20 '18 at 12:00
  • @einpoklum "SIB" is part of some x86 instruction opcodes, instructions which use "index" register in memory indirection (like `[base_reg + scale*index_reg]`), see ordinary Intel docs about x86 instruction encoding (you can use/not-use it by using or not such x86 instructions). "port 7" is internal architecture of particular CPU, implementing the x86 ISA = area of transistors dedicated to certain tasks. You would have to study in-depth architecture of particular chip model which is using it, and as SW you have no *direct* control which parts of CPU are used to execute which instruction. – Ped7g Jan 20 '18 at 16:45
  • I'm quite sure port7 can't be used for load AGU or load anything. Everything I've seen documented, including from Intel always refers to as "Store AGU" only (and the better sources mention the restriction to "simple" addressing modes), and my own observations haven't turned up anything different (port7 predictably only takes stores with simple addressing modes - although not "greedily"). In any case, [I'll give it a quick test](https://github.com/travisdowns/uarch-bench/issues/28). – BeeOnRope Jan 20 '18 at 17:33
  • 1
    I wrote [a test](https://github.com/travisdowns/uarch-bench/blob/0d62050ecc35979e5ec9cf83710edbe914489e00/x86_methods.asm#L282) which has two loads (simple addressing) and a store (complex). Running it as `./uarch-bench.sh --timer=libpfc --test-name=misc/port7 --extra-events=UOPS_DISPATCHED.PORT_2,UOPS_DISPATCHED.PORT_3,UOPS_DISPATCHED.PORT_4,UOPS_DISPATCHED.PORT_7` I get exactly 1.5 cycles per iteration (i.e., bottlenecked on the 2 port2/3 AGUs for the 3 operations) and 1.5, 1.5, 1.0, 0.0 uops to ports 2,3,4 and 7 respectively, as you'd expect if port7 can't take any of the load. – BeeOnRope Jan 20 '18 at 17:54
  • @harold - yeah I think it's just a typo in Agner's doc, or when he wrote it he wasn't aware of the limitation. I'm pretty sure it now is since there have been various discussions on his forums where the store-only nature of port 7 is explicit. – BeeOnRope Jan 20 '18 at 17:57