2

Here we have three similar functions. They all always return 0.

This first function is optimized well. The compiler sees that x is always 1.

int f1()
{
    std::atomic<int> x = 1;
    if (x.load() == 1)
    {
        return 0;
    }
    return 1;
}
f1():                                 # @f3()
        xor     eax, eax
        ret

This second function is more interesting. The compiler realized that subtracting 0 from x doesn't do anything and so the operation was omitted. However, it doesn't notice that x is always 1 and does a cmparison. Also, despite using std::memory_order_relaxed an mfence instruction was emitted.

int f2()
{
    std::atomic<int> x = 1;
    if (x.fetch_sub(0, std::memory_order_relaxed) == 1)
    {
        return 0;
    }
    return 1;
}
f2():                                 # @f2()
        mov     dword ptr [rsp - 4], 1
        mfence
        mov     ecx, dword ptr [rsp - 4]
        xor     eax, eax
        cmp     ecx, 1
        setne   al
        ret

Finally, this is the real example of what I'm trying to optimize. What I'm really doing is implementing a simple shared_pointer and x represents the reference counter. I'd like the compiler to optimize the case when a temporary shared_pointer object is created and destroyed and avoid the needless atomic operation.

int f3()
{
    std::atomic<int> x = 1;
    if (x.fetch_sub(1, std::memory_order_relaxed) == 1)
    {
        return 0;
    }
    return 1;
}
f3():                                 # @f3()
        mov     dword ptr [rsp - 4], 1
        xor     eax, eax
        lock            dec     dword ptr [rsp - 4]
        setne   al
        ret

I am using clang. How can I make it optimize f3 like it did with f1?

janekb04
  • 4,304
  • 2
  • 20
  • 51
  • The use of stack based (local to a function) atomics is probably quite rare. What is your real use-case? – Richard Critten Apr 27 '23 at 12:19
  • 1
    Basically a duplicate of [Why don't compilers merge redundant std::atomic writes?](https://stackoverflow.com/q/45960387) which discusses the fact that compilers don't optimize atomics in general. I was surprised clang managed to optimize away even `f1`, I guess due to escape analysis and lack of any atomic RMWs. Also [Why do relaxed atomic operations prevent compiler optimizations?](https://stackoverflow.com/q/70578990) – Peter Cordes Apr 27 '23 at 12:21
  • @RichardCritten This is my real use-case. The `atomic` I am showing here is a member of my `shared_pointer`. However, here I show it inlined. This shared pointer is actually a shared coroutine that I use in my [job system](https://github.com/janekb04/job_system/tree/main). – janekb04 Apr 27 '23 at 12:21
  • So your `shared_pointer` keeps its ref count in the `shared_pointer` object itself, not in a dynamically allocated control block that `shared_pointer` objects reference? – Peter Cordes Apr 27 '23 at 12:22
  • @PeterCordes Well, I used a simplification. A C++20 coroutine has a "promise" and a "return object" that function as a _Promise_ and _Future_ pair. My `promise_type` stores the atomic reference count of its coroutine as a member. The `promise_type` is suballocated within its coroutine's coroutine frame. The coroutines themselves are typically allocated on the stack, thanks to HALO. So, if I have a coroutine `A` that runs and immediately `co_await`s a coroutine `B`, then: the coroutine frame of `B`, the refcount to `B` and the return object's pointer to `B`'s refcount are all on `A`'s stack. – janekb04 Apr 27 '23 at 12:34
  • Ok, so there's probably no hope of having the object optimize away completely like `f1`, that's probably too much going on with passing a reference to it to something that looks like a function call. Perhaps what you actually need is a `shared_pointer` that's only safe for use within a single thread, since a coroutine runs as part of the same OS-level thread. That means your increment/decrement don't need to be atomic on a machine level, since it's basically cooperative multi-tasking; a coroutine switch won't happen in the middle of a `count++`, right? – Peter Cordes Apr 27 '23 at 12:42
  • @PeterCordes Well, it is true that a coroutine switch will not happen in the middle of `count++`. However, I can copy a coroutine's future and send it to a different thread. Then, two threads have futures referencing the same coroutine. Then, both threads can copy their futures again and that will be two concurrent calls to `count++`. – janekb04 Apr 27 '23 at 12:45
  • 1
    Ok yeah, then your use-case does rule out the idea of taking advantage of being a single thread. – Peter Cordes Apr 27 '23 at 12:46
  • 2
    @PeterCordes OOC, Intel optimized also `f2` (but not `f3`): https://godbolt.org/z/Yvq3TcqnY – Daniel Langr Apr 27 '23 at 13:23
  • 2
    @DanielLangr: Interesting. And that's ICX, LLVM-based, so maybe there's hope for clang. Their classic ICC (2021.7.1 https://godbolt.org/z/7xPzz3qhn) doesn't optimize any of them at all, and amusingly emits broken Intel-syntax asm (like `lock xaddl ecx, (rdx)` - I guess it only know how to print AT&T syntax properly for its atomic RMWs.) – Peter Cordes Apr 27 '23 at 14:09

0 Answers0