You can report the performance regression on clang's bug tracker, especially if you can produce a MCVE / [mcve] that's slower with GCC9 for the same reason as your main code.
You can't generally force the compiler's instruction-selection choices without fully doing it yourself with inline asm.
But tuning options can be relevant, e.g. -march=native
to tune for your hardware. (-march=haswell
, or -march=native
on a haswell, for example sets -mtune=haswell
as well as enabling all the ISA extensions you have.)
e.g. gcc or clang will sometimes avoid inc
in general, but use it when compiling for a specific CPU that doesn't have a problem with it. (Not talking about the lock
version, just inc reg
).
It looks like that's the case here: using -march=skylake
gets clang9.0 to use lock inc
instead of lock add [mem], 1
, but it didn't with clang8.0 on Godbolt. (With just -O3
, clang9.0 still uses lock add
/sub)
inc reg
is good on modern x86 (except for Silvermont / KNL), but inc mem
(without lock) costs an extra uop on Intel CPUs: no micro-fusion of the load+add uops, only the store part (https://agner.org/optimize/ and https://uops.info/). IDK if it's also worse with lock inc
vs. lock add
or if you're seeing some other effect. INC instruction vs ADD 1: Does it matter?
According to https://uops.info/, lock add m32, imm8
has identical uop count (8), latency and throughput to lock inc m32
on Skylake. On Haswell there's one fewer back-end uop, like the difference without a lock prefix. But it's unlikely that affects throughput. I didn't check other uarches and you didn't say what CPU you had.
I wouldn't really recommend -march=skylake -mtune=generic
: it might fix this code-gen issue but might lead to worse tuning decisions in the rest of your code. Except it doesn't even work, I guess clang is different from GCC in how it handles arch and tune options. I guess you could avoid march options entirely and leave -mtune
at the default, and just enable -mavx2 -mfma -mpopcnt -mbmi -mbmi2 -maes -mcx16
and any other relevant ISA extensions your CPU has.
and lock xadd
for other numbers (at least for addition...
Are you sure you still enabled optimization for the new compiler?
When the result of --*p
or atomic_fetch_add(p, -2)
is unused, clang 9.0 still uses lock dec
or lock sub
. I can only get clang to use lock xadd
if I disable optimization, making the surrounding code into total garbage.
Or with optimization enabled, by returning the result. IDK, maybe in more complex functions, clang9.0 changed something that means it's not finding the same optimizations in your code, and using lock xadd
to get the old value into a register. Like maybe to return it to a caller that ignores it, if it decided not to inline as aggressively.
lock xadd
is definitely slower than lock add
or lock sub
, but clang doesn't use it unless it has to (or if you disable optimization).
asm output for clang8.0 -O3 -march=skylake
vs. clang9.0 -O3 -march=skylake
(not including a ret
) (Godbolt)
#include <stdatomic.h>
void incmem(int *p) { ++*p; }
clang8: addl $1, (%rdi) clang9: incl (%rdi)
void atomic_inc(_Atomic int *p) { ++*p; }
clang8: lock addl $1, (%rdi) clang9: lock incl (%rdi)
void atomic_dec(_Atomic int *p) { --*p; }
clang8: lock subl $1, (%rdi) clang9: lock decl (%rdi)
void atomic_dec2(_Atomic int *p) {
atomic_fetch_add(p, -2);
}
clang8: lock addl $-2, (%rdi) clang9: lock addl $-2, (%rdi)
// returns the result
int fetch_dec(_Atomic int *p) { return --*p; }
clang8: clang9:
movl $-1, %eax movl $-1, %eax
lock xaddl %eax, (%rdi) lock xaddl %eax, (%rdi)
addl $-1, %eax decl %eax
retq
With optimization disabled, we get clang 8 and 9 making literally identical code with -O0 -march=skylake
:
# both clang8 and 9 with -O0 -march=skylake
atomic_dec2:
pushq %rbp
movq %rsp, %rbp
movq %rdi, -8(%rbp)
movq -8(%rbp), %rax
movl $-2, -12(%rbp)
movl -12(%rbp), %ecx
lock xaddl %ecx, (%rax) # even though result is unused
movl %ecx, -16(%rbp)
popq %rbp
retq