Are RMW instructions considered harmful on modern x86?

Question

I recall that read-modify-write instructions are generally to be avoided when optimizing x86 for speed. That is, you should avoid something like add [rsi], 10, which adds to the memory location stored in rsi. The recommendation was usually to split it into a read-modify instruction, followed by a store, so something like:

mov rax, 10
add rax, [rsp]
mov [rsp], rax

Alternately, you might use explicit load and stores and a reg-reg add operation:

mov rax, [esp]
add rax, 10
mov [rsp], rax

Is this still reasonable advice (and was it ever?) for modern x86?¹

Of course, in cases where a value from memory is used more than once, RMW is inappropriate, since you will incur redundant loads and stores. I'm interested in the case where a value is only used once.

Based on exploration in Godbolt, all of icc, clang and gcc prefer to use a single RMW instruction to compile something like:

void Foo::f() {
  x += 10;
}

into:

Foo::f():
    add     QWORD PTR [rdi], 10
    ret

So at least most compilers seem to think RMW is fine, when the value is only used once.

Interestingly enough, the various compilers do not agree when the incremented value is a global, rather than a member, such as:

int global;

void g() {
  global += 10;
}

In this case, gcc and clang still a single RMW instruction, while icc prefers a reg-reg add with explicit loads and stores:

g():
        mov       eax, DWORD PTR global[rip]                    #5.3
        add       eax, 10                                       #5.3
        mov       DWORD PTR global[rip], eax                    #5.3
        ret

Perhaps it is something to do with RIP relative addressing and micro-fusion limitations? However, icc13 still does the same thing with -m32 so perhaps it's more to do with the addressing mode requiring a 32-bit displacement.

¹I'm using the deliberately vague term modern x86 to basically mean the last few generations of Intel and AMD laptop/desktop/server chips.

Did you check Agner Fog's microarch guide? RIP-relative addressing modes might be special somehow, but I think they can micro-fuse sometimes. But generally, IIRC, memory-destination ops are at most the same number of uops as separate load/modify/store instructions. IIRC it wasn't true for `adc` on Broadwell/Skylake. — Peter Cordes, Jun 26 '16 at 01:44
icc might favor separate instructions so they can all be single-uop instructions. godbolt's icc13 is old; licensing issues make it unclear whether it's legal to provide a newer version. It might be optimizing for decode speed on CPUs with no uop cache, where a multi-uop instruction that isn't the first in a decode group will reduce decoder throughput. (OTOH, a multi-uop instruction that *is* the first in a decode group increases decode throughput, since Core2/Nehalem decoders can decode up to 4-1-1-1, unlike SnB-family). — Peter Cordes, Jun 26 '16 at 01:46
I might turn this into an answer later, but for now this is just a quick comment without looking anything up and just skimming the question. — Peter Cordes, Jun 26 '16 at 01:47
icc might be trying to avoid disp32 + immediate in the same instruction (See my edit to the question; it does the same thing with -m32). disp + imm has an impact on micro-fusion, IIRC. Or at least can require multiple entries in the uop cache, even if it's actually a single fused-domain uop as far as the ROB is concerned. Although I thought that was only a problem with disp32+imm32, not just disp32 + imm8. Anyway, I might dig up this stuff in Agner's manuals, but feel free to dig it up yourself based on these comments as a starting point, and write it up as a nice answer. — Peter Cordes, Jun 26 '16 at 01:55
Darek Mihocka tested this a while back with various architectures; you can read him noodle about it on his blog [here](http://www.emulators.com/docs/nx06_rmw.htm) (part of a fascinating series of articles on problems with the Pentium 4 and x86 in general). Different compilers have different strategies for code generation, but the reason is because performance tuning is complicated. There's no one-size-fits-all strategy. You have to benchmark on your target architecture it if you really care. — Cody Gray - on strike, Jun 26 '16 at 09:58
Here's what Agner (page 107) says about RIP-relative addressing (for Nehalem - there doesn't seem to be an update since then): _Instructions that have both a rip-relative address and immediate data cannot use µop fusion. For example, CMP BYTE PTR [RIP+m],AL can fuse, but CMP BYTE PTR [RIP+m],1 cannot._ Does `immediate data` here include offsets in addressing modes, like `global` in this case? I'm guessing not. So indeed the global variant wouldn't microfuse on 64-bits (doesn't explain 32-bits though). — BeeOnRope, Jun 26 '16 at 22:12
As you point out, some combinations of RIP-relative addressing and immediate also take more space in the uop cache. — BeeOnRope, Jun 26 '16 at 22:13
@BeeOnRope: didn't notice your comment originally. I think I was mixed up between micro-fusion and using 2 entries in a line(way) of the uop cache. (space can be borrowed from another uop in the same way, if other uops have spare space). Agner's table 9.1 indicates that an instruction with a disp32 in its addressing mode and any immediate data takes 2 entries in a way (unless the disp32 and immediate each fit in a sign-extended 16-bit integer). Such an instruction can still micro-fuse, though. — Peter Cordes, Oct 27 '16 at 08:51

Johan · Accepted Answer · 2016-08-25T21:11:45.070

6

Are RMW instructions considered harmful on modern x86?

No.

On modern x86/x64 the input instructions are translated into uops.
Any RMW instruction will be broken down into a number of uops; in fact into the same uops that separate instructions would be broken down into.

By using a 'complex' RMW instruction instead of separate 'simple' read, modify and write instructions you gain the following.

Fewer instructions to decode.
Better utilization of the instruction cache
Better utilization of the addressable registers

You can see this quite clearly in Agner Fog's instruction tables.

ADD [mem],const has a latency of 5 cycles.

MOV [mem],reg and visa versa has a latency of 2 cycles each and an ADD reg,const has a latency of 1 for a total of 5.

I checked the timings for Intel Skylake, but AMD K10 is the same.

You need to take into account that compilers have to cater to many different processors and some compilers even use the same core logic for different processor families. This can lead to quite suboptimal strategies.

RIP relative addressing
On X64 RIP relative addressing takes an extra cycle to resolve RIP on older processors.
Skylake does not have this delay and I'm sure others will eliminate the delay as well.
I'm sure you're aware that x86 does not support EIP relative addressing; on X86 you have to do this in a round-about fashion.

edited Aug 25 '16 at 21:11

answered Aug 25 '16 at 20:28

Johan

74,508
24
191
319

Agner's store/load latencies are more or less made up to split the total store-forwarding latency between the load and store. He unfortunately doesn't always list the load latency as the actual load-use latency that you'd see in a pointer-chasing loop. See [this answer](http://stackoverflow.com/questions/10274355/cycles-cost-for-l1-cache-hit-vs-register-on-x86/39133786#39133786) for the difference between store-forwarding and L1 load-use latency. – Peter Cordes Aug 25 '16 at 23:42
According to Agner's table, Haswell decodes `add m, r/i` to 2 fused-domain uops (if the addressing mode can micro-fuse). That's less than separate load/add/store. It is still 4 unfused-domain uops (including the separate store-data and store-address uops). The column for which ports it uses shows `2p0156 2p237 p4`, which I think is wrong. IDK why it would use two ALU uops (p0156), and that's incompatible with it also using 2 load/AGU ports and the store-data port and totalling 4 unfused-domain uops. Anyway, I think if anything, RMW instructions are win on modern OOO CPUs. – Peter Cordes Aug 25 '16 at 23:48
1

@PeterCordes I timed 1e7 iterations of 20 `add m, r/i` + `dec` + `jnz`. I got [this](http://pastebin.com/0i4q6E04). 2.2e8 instructions issue, as well as 4.1e8 fused uops. 2e8 loads, store-data and store-addresses happen. Port 4 handles 100% of store-datas, Port 7 handles 55% of the store-addresses, Ports 2 & 3 each handle 50% of the loads and 22.5% of the store-addresses. Port 6 handles 100% of dec+taken branches (1e7). What's left is 2.4e8 uops split across Ports 0,1,5,6, which aren't fully explained by 1 ALU uop for all 2e8 `add`. – Iwillnotexist Idonotexist Oct 27 '16 at 05:41
@IwillnotexistIdonotexist: Interesting. Is that Haswell or Skylake? Timer interrupts or whatever can't be frequent enough to add a huge number of extra uops, can they? Maybe try with a different register, to rule out stack-sync uops? Maybe use r8 as your counter, since your 64-bit DEC needs a REX prefix anyway, leaving RSI free. That will make all the ADD instructions one byte shorter, since RSP as a base register requires a SIB byte even with no index :( I highly doubt code-size has anything to do with this, but RSP vs. non-RSP is possible. So prob. just try `mov rsi, rsp`. – Peter Cordes Oct 27 '16 at 06:08
@PeterCordes Nah, that's with my `libpfc`, which intentionally disables counting while in kernel mode. Changing my register alloc to elsewhere than `rsp` made no difference. I think I actually have a surprisingly simple explanation: The core alternates running flat-out then stalling for 1 cycle. Observe that on 1e8 cycles 0 uops were issued, on 1e7 cycles 1 uop was issued, and on all 1e8 other cycles 4 uops were issued. As it happens there are also 1e8 `resource_stalls.rs` events. It's like the dec+branch is isolated out. – Iwillnotexist Idonotexist Oct 27 '16 at 06:25
1

@IwillnotexistIdonotexist: IIRC, it's normal that no uops issue if a group of 4 is ready but the RS or ROB is full. I think I've seen that before. So yes, the front-end stalls while the OOO core is full, rather than feeding in a trickle. The counts look great for issue (41 fused-domain uops per iteration, so 2 per add) and for retirement (81 unfused-domain uops per iter, so 4 per add.) – Peter Cordes Oct 27 '16 at 07:28
@IwillnotexistIdonotexist: But that doesn't match with the per-port counts (`(58637482+60167788+61578783+71445964)/ 10000000` = ~25.18 ALU uops per iter?). That's what you were saying about 2.4e8 uops across the ALU ports, right? That's the only thing that's now unexplainable. (I think we've confirmed, that Agner's table is wrong, and it's 1 ALU uop). Your per-port uops were counted in different runs. Maybe the CPU settles into a pattern, and happened to run more uops on the ports you were measuring each run? Try recording p0,p1,p5,and p6 in a single run. Also, HSW or SKL? – Peter Cordes Oct 27 '16 at 07:31
1

@PeterCordes HSW as always. Yes, I was referring to the mystery extra 25% workload from somewhere. I did another run, recording p0156 and p2347 simultaneously: Same sort of thing. p0156 gave `58578120+60002738+62237010+72324777=253142645`. – Iwillnotexist Idonotexist Oct 27 '16 at 07:49
1

@IwillnotexistIdonotexist: At least it seems reproducible. I'm having a hard time coming up with any plausible guess at an explanation. Maybe the OOO core can re-execute uops or generate new uops internally? (without retiring them?) What if we take store-forwarding out of the picture entirely, by looping over a *giant* array. The cycle count will obviously be dominated by cache / TLB misses and COW page-faults, but I just want the ALU uop count. (Probably single-byte adds will be the same as dwords, reducing memory footprint for the no-reuse case). – Peter Cordes Oct 27 '16 at 08:38
@PeterCordes Just tried doing 20 `add [byte], r` + `sub` + `jnz` on a buffer of 200MB. Roughly the same sort of readings: 2.2e8 insns, 4.1e8 uops issued, p0156: `68209206+70595607+72926629+83506751`, p2347: `143388559+148493900+200008711+108157372`, 8.1e8 retired uops, 1.1e8 `resource_stalls.rs`. On about `1.4e8 / 2.5e8` cycles no uops were issued; On `> 1.0e8` other cycles 4 uops issued. – Iwillnotexist Idonotexist Oct 27 '16 at 09:02
@IwillnotexistIdonotexist: So we're seeing a different ratio of ADD insns to ALU uops. 29.5 p0156 uops per iter - 1macro-fused branch = 28.5 uops / 20 ADDs = 1.476 ALU uops per ADD. (vs. 1.259 for the tighter loop). What if you try a loop with mostly `add r,r` (or `mov r,imm` to avoid any loop-carried dependency?), and only a couple `add m,r`? If the bottleneck is ALU uops, not the store-data port, will we still see an extra .26 uops per memory-dest ADD? Or maybe other bottlenecks, like front-end if you can bust the uop cache / loop buffer somehow? Or a shorter loop-carried mem dep? – Peter Cordes Oct 27 '16 at 09:56
1

@IwillnotexistIdonotexist: if you have time, can you also test uop throughput from the LSD for small loops that aren't a multiple of 4 uops on your Haswell CPU? BeeonRope has test code for this [in the answer to this question](http://stackoverflow.com/questions/39311872/is-performance-reduced-when-executing-loops-whose-uop-count-is-not-a-multiple-of). The only data points we have are Skylake (unrolling inside the loop buffer or something) and incomplete for SnB (the first uop of one iter can't issue in the same group as the last uop of the previous). – Peter Cordes Oct 28 '16 at 01:56
@PeterCordes Yeah, I can try. I'm a bit busy with other things, but I'll get to it. – Iwillnotexist Idonotexist Oct 28 '16 at 02:13

Are RMW instructions considered harmful on modern x86?

1 Answers1

Linked

Related