How can I make Clang optimize redundant read of atomic when its value is know ahead of time?

Question

Here we have three similar functions. They all always return 0.

This first function is optimized well. The compiler sees that x is always 1.

int f1()
{
    std::atomic<int> x = 1;
    if (x.load() == 1)
    {
        return 0;
    }
    return 1;
}

f1():                                 # @f3()
        xor     eax, eax
        ret

This second function is more interesting. The compiler realized that subtracting 0 from x doesn't do anything and so the operation was omitted. However, it doesn't notice that x is always 1 and does a cmparison. Also, despite using std::memory_order_relaxed an mfence instruction was emitted.

int f2()
{
    std::atomic<int> x = 1;
    if (x.fetch_sub(0, std::memory_order_relaxed) == 1)
    {
        return 0;
    }
    return 1;
}

f2():                                 # @f2()
        mov     dword ptr [rsp - 4], 1
        mfence
        mov     ecx, dword ptr [rsp - 4]
        xor     eax, eax
        cmp     ecx, 1
        setne   al
        ret

Finally, this is the real example of what I'm trying to optimize. What I'm really doing is implementing a simple shared_pointer and x represents the reference counter. I'd like the compiler to optimize the case when a temporary shared_pointer object is created and destroyed and avoid the needless atomic operation.

int f3()
{
    std::atomic<int> x = 1;
    if (x.fetch_sub(1, std::memory_order_relaxed) == 1)
    {
        return 0;
    }
    return 1;
}

f3():                                 # @f3()
        mov     dword ptr [rsp - 4], 1
        xor     eax, eax
        lock            dec     dword ptr [rsp - 4]
        setne   al
        ret

I am using clang. How can I make it optimize f3 like it did with f1?

The use of stack based (local to a function) atomics is probably quite rare. What is your real use-case? — Richard Critten, Apr 27 '23 at 12:19
Basically a duplicate of [Why don't compilers merge redundant std::atomic writes?](https://stackoverflow.com/q/45960387) which discusses the fact that compilers don't optimize atomics in general. I was surprised clang managed to optimize away even `f1`, I guess due to escape analysis and lack of any atomic RMWs. Also [Why do relaxed atomic operations prevent compiler optimizations?](https://stackoverflow.com/q/70578990) — Peter Cordes, Apr 27 '23 at 12:21
@RichardCritten This is my real use-case. The `atomic` I am showing here is a member of my `shared_pointer`. However, here I show it inlined. This shared pointer is actually a shared coroutine that I use in my [job system](https://github.com/janekb04/job_system/tree/main). — janekb04, Apr 27 '23 at 12:21
So your `shared_pointer` keeps its ref count in the `shared_pointer` object itself, not in a dynamically allocated control block that `shared_pointer` objects reference? — Peter Cordes, Apr 27 '23 at 12:22
@PeterCordes Well, I used a simplification. A C++20 coroutine has a "promise" and a "return object" that function as a _Promise_ and _Future_ pair. My `promise_type` stores the atomic reference count of its coroutine as a member. The `promise_type` is suballocated within its coroutine's coroutine frame. The coroutines themselves are typically allocated on the stack, thanks to HALO. So, if I have a coroutine `A` that runs and immediately `co_await`s a coroutine `B`, then: the coroutine frame of `B`, the refcount to `B` and the return object's pointer to `B`'s refcount are all on `A`'s stack. — janekb04, Apr 27 '23 at 12:34
Ok, so there's probably no hope of having the object optimize away completely like `f1`, that's probably too much going on with passing a reference to it to something that looks like a function call. Perhaps what you actually need is a `shared_pointer` that's only safe for use within a single thread, since a coroutine runs as part of the same OS-level thread. That means your increment/decrement don't need to be atomic on a machine level, since it's basically cooperative multi-tasking; a coroutine switch won't happen in the middle of a `count++`, right? — Peter Cordes, Apr 27 '23 at 12:42
@PeterCordes Well, it is true that a coroutine switch will not happen in the middle of `count++`. However, I can copy a coroutine's future and send it to a different thread. Then, two threads have futures referencing the same coroutine. Then, both threads can copy their futures again and that will be two concurrent calls to `count++`. — janekb04, Apr 27 '23 at 12:45
Ok yeah, then your use-case does rule out the idea of taking advantage of being a single thread. — Peter Cordes, Apr 27 '23 at 12:46
@PeterCordes OOC, Intel optimized also `f2` (but not `f3`): https://godbolt.org/z/Yvq3TcqnY — Daniel Langr, Apr 27 '23 at 13:23
@DanielLangr: Interesting. And that's ICX, LLVM-based, so maybe there's hope for clang. Their classic ICC (2021.7.1 https://godbolt.org/z/7xPzz3qhn) doesn't optimize any of them at all, and amusingly emits broken Intel-syntax asm (like `lock xaddl ecx, (rdx)` - I guess it only know how to print AT&T syntax properly for its atomic RMWs.) — Peter Cordes, Apr 27 '23 at 14:09

How can I make Clang optimize redundant read of atomic when its value is know ahead of time?

0 Answers0