Is my understanding of __ATOMIC_SEQ_CST correct? (I'd like to write a mutex with it + atomics)

Question

For fun I'm writing my own threading library used by me and a friend or two. First thing I'd like to write is a mutex

It appears I'm generating the assembly I want. __atomic_fetch_add seems to generate lock xadd and __atomic_exchange seems to generate xchg (not cmpxchg). I use both with __ATOMIC_SEQ_CST (for now I'll stick to that)

If I am using __ATOMIC_SEQ_CST will gcc or clang understand these are synchronizing function? If I write lock(); global++; unlock(); will any compilers move global++; before or after the lock/unlock function? Do I need to call __atomic_thread_fence(__ATOMIC_SEQ_CST); or __sync_synchronize(); for any reason? (they seem to do the same thing on x86-64). https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync seems to suggest my understanding is correct but its easy to misread documentation and I sometimes wonder if scope plays a part in these rules.

I think using those intrinsic the behavior of code in between a lock/unlock will act the same way as a pthread mutex lock/unlock?

Even `__ATOMIC_ACQUIRE` or RELEASE as the memory-order for an atomic RMW would give sufficient ordering wrt. other operations, including non-atomic ones. That's the whole point of memory orders other than `__ATOMIC_RELAXED`. And why ISO C++ defines taking a `std::mutex` as an "acquire" operation on the mutex object. If you're looking at the asm, you know every x86 atomic RMW is already a full memory barrier, right? — Peter Cordes, May 04 '22 at 06:18
Any particular reason you're using gcc's builtin atomics, instead of the C++11 `std::atomic`, which is much more standardized and widely understood? — Nate Eldredge, May 04 '22 at 06:31
@PeterCordes Oh? Lets say I'm in unlock, if I do an acquire on an int would I not get myself into trouble because the global variable didn't flush to other caches? In my __atomic_fetch_add/`lock xadd` case would I want at least memory_order_acq_rel? I reread https://en.cppreference.com/w/cpp/atomic/memory_order and it looks like that is the case. It seems like relaxed and consume are easy to get wrong since it's not easy to see the dependency chain? Maybe relaxed and console would work for a lockless 1:1 channel — Eric Stotch, May 04 '22 at 06:32
@NateEldredge I've been writing libraries from scratch in the past year often in assembly to get a better understanding how everything works — Eric Stotch, May 04 '22 at 06:34
If you're going to use a simple counter for a mutex, incrementing with `__atomic_fetch_add` to take it, then `acquire` is sufficient. It's loading the value 0 that proves the mutex belongs to you; the critical section can be reordered up to that point but not earlier. But `acq_rel` doesn't add any value; making the store half of the `fetch_add` be release doesn't accomplish anything, and moreover doesn't stop the critical section floating above the store. Likewise your `__atomic_fetch_sub` to drop the mutex need only be release, not `acq_rel`. — Nate Eldredge, May 04 '22 at 06:39
Unlock has to be a RELEASE operation, which x86 can actually do with a pure store. (Taking a lock still has to be an atomic RMW, which means x86 asm has to do it with effectively seq_cst semantics. But acquire would allow compile-time reordering.) See [Locks around memory manipulation via inline assembly](https://stackoverflow.com/a/37246263) / [What is the minimum X86 assembly needed for a spinlock](https://stackoverflow.com/q/22943572) — Peter Cordes, May 04 '22 at 06:47
@NateEldredge: You kind of have that backwards. Alpha *doesn't* support `mo_consume`, and must strengthen it to `acquire`. All other mainstream platforms do dependency ordering in hardware (or like x86 give you full acquire for free). So there is actually something to be gained from `consume`, like `int *ptr = shared.load(consume)` / `tmp = *ptr` doesn't need any barriers even on ARM or PowerPC, but would for `acquire`. (Of course, `consume` is temporarily deprecated because it's too hard for compilers; they promote to `acquire` and use a barrier. Linux RCU hand-rolls consume data deps.) — Peter Cordes, May 04 '22 at 06:51
@NateEldredge Hmm... That sounds correct. I never understood the up/down or (drop) explanation. I understand __atomic_fetch_add will get me 0 if I got the mutex and > 0 if it was contended. But on an unlock I heard I was suppose to set it to a negative number and wake another thread up but I didn't understand. Is it any better than in an unlock to load (or xchg) the counter, see if its == 1 (uncontested) then executing wakeup code if it was not? — Eric Stotch, May 04 '22 at 06:51

score 2 · Accepted Answer · answered May 04 '22 at 06:48

If I am using __ATOMIC_SEQ_CST will gcc or clang understand these are synchronizing function?

Yes, that is the entire reason for these primitives to have memory ordering semantics.

The memory ordering semantics accomplish two things: (1) ensure that the compiler emits instructions that include the appropriate barriers to prevent reordering by the CPU; (2) ensure that compiler optimizations do not move code past those instructions in ways that the barrier should forbid.

If I write lock(); global++; unlock(); will any compilers move global++; before or after the lock/unlock function?

They will not. Again, that is the whole point of being able to specify memory ordering.

Do I need to call __atomic_thread_fence(__ATOMIC_SEQ_CST); or __sync_synchronize(); for any reason?

Not in this setting, no.

Is my understanding of __ATOMIC_SEQ_CST correct? (I'd like to write a mutex with it + atomics)

1 Answers1