clang: Is there a way to specify low level instructions used in c11 atomic operations?

Question

Can I tell clang to "use xxx instructions" or "don't use yyy instructions" for built-in or C11 atomic operations (suppose there is an alternative)?

[edit] the microbenchmark that can reproduce the performance regression: a naive rwlock.

It all began with a recent upgrade to clang 9.0. There are some performance regressions in my program so I tried to inspect the instructions generated. Surprisingly clang has changed a lot on what instructions to use for the atomic functions (c11's atomic_fetch_{add,sub} on x86_64).

With clang 8, it all generates lock addl and lock subl. But clang 9 switched to lock incl and lock decl for +1 and -1, and lock xadd for other numbers (at least for addition. I didn't check if there was an xsub, or subl is still there). The inc/xadd/add instructions usually has same speed but in certain scenarios I observed better efficiency with lock addl. Anyway, I'm not going to discuss if one insn can be faster than the other. Since clang 8 used one, there better be a way to continue using it.

I tried to use inline assembly to force the use of lock addl but it's hard to fully exploit the efficiency as inline assembly prevents some optimizations. The ideal solution is to tell the compiler to do it the old way.

I'm not posting any code here because the question is not about improving performance. But I'd like to share the code for anyone who is willing to verify the performance issue.

[update] I tested it on both Broadwell and Skylake. The performance regression is only visible with certain workloads. I'm not sure if it's actually related to the instructions or the way clang8/9 do optimizations.

`xsub` does not exist so clang can't use that, it uses `lock xadd` with a negative value — harold, Oct 18 '19 at 09:38

score 1 · Answer 1 · answered Oct 18 '19 at 20:56

You can report the performance regression on clang's bug tracker, especially if you can produce a MCVE / [mcve] that's slower with GCC9 for the same reason as your main code.

You can't generally force the compiler's instruction-selection choices without fully doing it yourself with inline asm.

But tuning options can be relevant, e.g. -march=native to tune for your hardware. (-march=haswell, or -march=native on a haswell, for example sets -mtune=haswell as well as enabling all the ISA extensions you have.)

e.g. gcc or clang will sometimes avoid inc in general, but use it when compiling for a specific CPU that doesn't have a problem with it. (Not talking about the lock version, just inc reg).

It looks like that's the case here: using -march=skylake gets clang9.0 to use lock inc instead of lock add [mem], 1, but it didn't with clang8.0 on Godbolt. (With just -O3, clang9.0 still uses lock add/sub)

inc reg is good on modern x86 (except for Silvermont / KNL), but inc mem (without lock) costs an extra uop on Intel CPUs: no micro-fusion of the load+add uops, only the store part (https://agner.org/optimize/ and https://uops.info/). IDK if it's also worse with lock inc vs. lock add or if you're seeing some other effect. INC instruction vs ADD 1: Does it matter?

According to https://uops.info/, lock add m32, imm8 has identical uop count (8), latency and throughput to lock inc m32 on Skylake. On Haswell there's one fewer back-end uop, like the difference without a lock prefix. But it's unlikely that affects throughput. I didn't check other uarches and you didn't say what CPU you had.

I wouldn't really recommend -march=skylake -mtune=generic: it might fix this code-gen issue but might lead to worse tuning decisions in the rest of your code. Except it doesn't even work, I guess clang is different from GCC in how it handles arch and tune options. I guess you could avoid march options entirely and leave -mtune at the default, and just enable -mavx2 -mfma -mpopcnt -mbmi -mbmi2 -maes -mcx16 and any other relevant ISA extensions your CPU has.

and lock xadd for other numbers (at least for addition...

Are you sure you still enabled optimization for the new compiler?

When the result of --*p or atomic_fetch_add(p, -2) is unused, clang 9.0 still uses lock dec or lock sub. I can only get clang to use lock xadd if I disable optimization, making the surrounding code into total garbage.

Or with optimization enabled, by returning the result. IDK, maybe in more complex functions, clang9.0 changed something that means it's not finding the same optimizations in your code, and using lock xadd to get the old value into a register. Like maybe to return it to a caller that ignores it, if it decided not to inline as aggressively.

lock xadd is definitely slower than lock add or lock sub, but clang doesn't use it unless it has to (or if you disable optimization).

asm output for clang8.0 -O3 -march=skylake vs. clang9.0 -O3 -march=skylake (not including a ret) (Godbolt)

#include <stdatomic.h>

void incmem(int *p) { ++*p; }
    clang8:     addl   $1, (%rdi)       clang9:  incl    (%rdi)

void atomic_inc(_Atomic int *p) { ++*p; }
    clang8: lock addl  $1, (%rdi)       clang9: lock incl (%rdi)

void atomic_dec(_Atomic int *p) { --*p; }
    clang8: lock subl  $1, (%rdi)       clang9: lock decl (%rdi)

void atomic_dec2(_Atomic int *p) {
    atomic_fetch_add(p, -2);
}
    clang8: lock addl  $-2, (%rdi)      clang9: lock addl $-2, (%rdi)


// returns the result
int fetch_dec(_Atomic int *p) { return --*p; }
    clang8:                                    clang9:
        movl  $-1, %eax                            movl  $-1, %eax
        lock xaddl   %eax, (%rdi)                  lock  xaddl %eax, (%rdi)
        addl  $-1, %eax                            decl    %eax
        retq

With optimization disabled, we get clang 8 and 9 making literally identical code with -O0 -march=skylake:

# both clang8 and 9 with  -O0 -march=skylake
atomic_dec2:
        pushq   %rbp
        movq    %rsp, %rbp
        movq    %rdi, -8(%rbp)
        movq    -8(%rbp), %rax
        movl    $-2, -12(%rbp)
        movl    -12(%rbp), %ecx
        lock    xaddl   %ecx, (%rax)      # even though result is unused
        movl    %ecx, -16(%rbp)
        popq    %rbp
        retq

Appreciate it! I'll soon report it to the clang/llvm mailing list. — wuxb, Oct 19 '19 at 00:29
@Nybble: submit it on https://bugs.llvm.org as a missed-optimization aka performance bug, if you have a MCVE for it. — Peter Cordes, Oct 19 '19 at 00:47
+1 becomes lock inc, and +n becomes lock xadd. I don't think it's about optimization. -O0 should never *intentionally* produce slower code. (I was using -O3 -march=native -flto) — wuxb, Oct 19 '19 at 00:57
@Nybble: Like I said in my answer, I can't reproduce `fetch_add` compiling to `lock xadd` on clang9 or trunk on Godbolt, except when the result is used so it has no choice. What clang version are you using exactly? — Peter Cordes, Oct 19 '19 at 01:00
@Nybble: re: `-O0` should never intentionally produce slower code: well that's exactly true, it intentionally compiles quickly, spending little CPU time optimizing the resulting asm. My `fetch_add` with `-O0` example is showing we get a `lock xadd` because optimization doesn't notice the result is unused, and/or doesn't try to peephole optimize it down to `lock inc` or `lock add`. Missing peephole optimizations is exactly the kind of thing we should expect from `-O0`. (Along with massive slowdowns from not doing register allocations, instead reload+spill around every statement for debugging) — Peter Cordes, Oct 19 '19 at 01:05
I have updated the question with a link to the source code and results obtained from my machines. I have not acquired an account on llvm bug reports so that will take some time. — wuxb, Oct 19 '19 at 06:03
@Nybble: oh, your `lock xadd` finally makes sense now. Would have saved a lot of time if you'd provided a MCVE. You *are* using the result, but when adding only 1 the compiler can use the flag result of `inc` or `add`. https://godbolt.org/z/uVEiTv. Getting `xadd` when you use `+2` or +3` happens on clang 8 as well, that part is a total red herring. (It does rule it out as a useful workaround). Anyway, very likely your performance regression is unrelated to lock inc vs. lock add, unless something is sensitive to code alignment (affected by code size). e.g. branch prediction aliasing. — Peter Cordes, Oct 19 '19 at 06:15

clang: Is there a way to specify low level instructions used in c11 atomic operations?

1 Answers1

Are you sure you still enabled optimization for the new compiler?