I recall that read-modify-write instructions are generally to be avoided when optimizing x86 for speed. That is, you should avoid something like add [rsi], 10
, which adds to the memory location stored in rsi
. The recommendation was usually to split it into a read-modify instruction, followed by a store, so something like:
mov rax, 10
add rax, [rsp]
mov [rsp], rax
Alternately, you might use explicit load and stores and a reg-reg add operation:
mov rax, [esp]
add rax, 10
mov [rsp], rax
Is this still reasonable advice (and was it ever?) for modern x86?1
Of course, in cases where a value from memory is used more than once, RMW is inappropriate, since you will incur redundant loads and stores. I'm interested in the case where a value is only used once.
Based on exploration in Godbolt, all of icc, clang and gcc prefer to use a single RMW instruction to compile something like:
void Foo::f() {
x += 10;
}
into:
Foo::f():
add QWORD PTR [rdi], 10
ret
So at least most compilers seem to think RMW is fine, when the value is only used once.
Interestingly enough, the various compilers do not agree when the incremented value is a global, rather than a member, such as:
int global;
void g() {
global += 10;
}
In this case, gcc
and clang
still a single RMW instruction, while icc
prefers a reg-reg add with explicit loads and stores:
g():
mov eax, DWORD PTR global[rip] #5.3
add eax, 10 #5.3
mov DWORD PTR global[rip], eax #5.3
ret
Perhaps it is something to do with RIP
relative addressing and micro-fusion limitations? However, icc13 still does the same thing with -m32
so perhaps it's more to do with the addressing mode requiring a 32-bit displacement.
1I'm using the deliberately vague term modern x86 to basically mean the last few generations of Intel and AMD laptop/desktop/server chips.