Why is acquire semantics only for reads, not writes? How can an LL/SC acquire CAS take a lock without the store reordering with the critical section?

Question

To start with, consider release semantics. If a data set is protected with a spinlock (mutex, etc. - no matters what exact implementation is used; for now, assume 0 means it's free and 1 - busy). After changing of the data set, a thread stores 0 to spinlock address. To force visibility of all previous actions before storing 0 to spinlock address, storing is executed with release semantics, that means all previous reads and writes shall be made visible to other threads before this storing. It is implementation detail whether this is done with full barrier, or release mark of the single store operation. That is (I hope) clear without any doubt.

Then, consider them moment when spinlock ownership is being taken. To protect against race, this is any kind of compare-and-set operation. With single-instruction CAS implementation (X86, Sparc...), this is combined reading and writing. The same for X86 atomic XCHG. With LL/SC (most RISCs), this falls to:

Read (LL) the spinlock location until it shows free state. (Can be optimized with a kind of CPU stall.)
Write (SC) the value "occupied" (1 in our case). CPU exposes whether the operation was successful (condition flag, output register, etc.)
Check the write (SC) result and, if failed, go to step 1.

In all cases, the operation that shall be visible to other threads to show that spinlock is occupied, is writing of 1 to its location, and barrier shall be committed between this writing and following manipulations on the data set protected with the spinlock. Reading of this spinlock gives nothing to protection scheme, except permit of CAS or LL/SC operation.

But all really implemented schemes allow acquire semantics modification on reads (or CAS), not writes. As result, LL/SC scheme would require additional final read-with-acquire operation on the spinlock to commit the required barrier. But there is no such instruction in typical output. For example, if compile on ARM:

  for(;;) {
    int e{0};
    int d{1};
    if (std::atomic_compare_exchange_weak_explicit(p, &e, d,
          std::memory_order_acquire,
          std::memory_order_relaxed)) {
      return;
    }
  }

its output contains first LDAXR == LL+acquire, then STXR == SC (without barrier in it, so, there is no guarantee other threads will see it?) This is likely not my artifact but is generated e.g. in glibc: pthread_spin_trylock calls __atomic_compare_exchange_weak_acquire (and no more barriers), that falls into GCC builtin __atomic_compare_exchange_n with acquire on mutex reading and no release on mutex writing.

It seems Iʼve missed some principal detail in this consideration. Would anybody correct it?

This also could fall into 2 sub-questions:

SQ1: In instruction sequence like:

(1) load_linked+acquire mutex_address     ; found it is free
(2) store_conditional mutex_address       ; succeeded
(3) read or write of mutex-protected area

what prevents CPU against reordering (2) and (3), with result that other threads won't see mutex is locked?

SQ2: Is there a design factor that suggests having acquire semantics only on loads?

I've seen that some examples of lock-free code, such as:

thread 1:

var = value;
flag.store(true, std::memory_order_release);

thread 2:

if (flag.load(std::memory_order_acquire)) {
   // We already can access it!!!
   value = var;
   ... do something with value ...
}

but this should have been made working after the mutex-protected style gets working stably.

@PeterCordes This is the same lock-free case I've added, but it doesn't explain how mutex protects data if its succeeded store isn't protected against reordering with future reads and writes. — Netch, Oct 14 '19 at 05:49
@PeterCordes I've checked this example against GCC (5.5) for AArch64. In its output, there is no any barrier after LDAXR. https://pastebin.com/QiPhyqyw — Netch, Oct 14 '19 at 12:56
Of course there isn't; you didn't ask for one. The question is whether that's strong enough for taking a mutex. (Although if you had used seq_cst it still wouldn't need a barrier: AArch64's release-store instruction is a sequential-release.) — Peter Cordes, Oct 14 '19 at 13:03
In glibc, `pthread_spin_trylock` calls only `__atomic_compare_exchange_weak_acquire`, that falls into GCC builtin `__atomic_compare_exchange_n` with acquire on mutex reading and no release on mutex writing. So isn't it strong enough for locking a mutex? — Netch, Oct 14 '19 at 13:56

Peter Cordes · Accepted Answer · 2021-04-06T06:20:42.707

Its output contains first LDAXR == LL+acquire, then STXR == SC
(without barrier in it, so, there is no guarantee other threads will see it?)

Huh? Stores always become visible to other threads; the store buffer always drains itself as fast as possible. The question is only whether to block later loads/stores in this thread until the store buffer is empty. (That's required for seq-cst pure stores, for example).

STXR is exclusive and tied to the LL. So it and the load are indivisible in the global order of operations, as the load and store side of an atomic RMW operation, just like x86 does in one instruction with lock cmpxchg. (This is actually an over-statement: For purposes of ordering, is atomic read-modify-write one operation or two? - you can observe some effects of the store side of an atomic RMW reordering with later operations even when the load can't. I'm not sure I fully understand how this is still safe, but it is.)

The atomic RMW can move earlier (because acquire loads can do that, and so can relaxed stores). But it can't move later (because acquire-loads can't do that). Therefore the atomic RMW appears in the global order before any operations in the critical section, and is sufficient for taking a lock. It doesn't have to wait for earlier operations like cache-miss stores; it can let them move into the critical section. But that's not a problem.

However, if you had used an acq_rel CAS, it couldn't take the lock until after finishing all earlier loads/stores (because of the release semantics of the store side).

I'm not sure if there's any asm difference between acq_rel and seq_cst for an atomic RMW. IIRC, yes on PowerPC. Not on x86, all RMWs are seq_cst. AArch64 ARMv8.3 has ldapr which allows acq_rel without seq_cst. Before that, only ldar stlr / stlxr: it only has relaxed and sequential-release.

LDAR + STR would be like x86 cmpxchg without a lock prefix: acquire load and separate store. (Except that the store side of x86 cmpxchg is still a release-store (but not sequential-release) because of the x86 memory model.

Other confirmation of my reasoning that mo_acquire for the "success" side of a CAS is sufficient for taking a lock:

https://en.cppreference.com/w/cpp/atomic/memory_order says "The lock() operation on a Mutex is also an acquire operation"
Glibc's pthread_spin_trylock uses the GCC builtin __atomic_compare_exchange_n on the mutex with only acquire, not acq_rel or seq_cst. We know many smart people have looked at glibc. And on platforms where it's not effectively strengthened to seq-cst asm, bug bugs probably would have been noticed if there were any.

what prevents CPU against reordering (2) and (3), with result that other threads won't see mutex is locked?

That would require other threads see the LL and SC as separate operations, not as an atomic RMW. The whole point of LL/SC is to prevent that. Weaker ordering lets it move around as a unit, not split apart.

SQ2: Is there a design factor that suggests having acquire semantics only on loads?

Yes, consider pure loads and pure stores, not RMWs. Jeff Preshing on acq and rel semantics.

The one-way barrier of a release-store naturally works well with the store buffer on real CPUs. CPUs "want" to load early and store late. Perhaps Jeff Preshing's article Memory Barriers Are Like Source Control Operations is a helpful analogy for how CPUs interact with coherent cache. (Also related: this answer mentions that along with why compile-time reordering can help compilers optimize)

A store that could only appear earlier, not later, would basically require flushing the store buffer. i.e. relaxed store followed by a full barrier (like atomic_thread_fence(seq_cst), e.g. ARM dsb ish or x86 mfence or locked operation). This is what you get from a seq-cst store. So we more or less already have a name for it, and it's very expensive.

"STXR is exclusive and tied to the LL. So it and the load are indivisible in the global order of operations," - hmmm, this seems to be the core point. If some special rule extends acquire semantics of LL operation to its respective succeeded SC, this explains the things. I'll explore related rules in ARM documentation and return to the question after this. — Netch, Oct 14 '19 at 17:02
@Netch: I'm not sure where to find it in the manuals, but yes, LL+SC form an *atomic* (from the Greek meaning indivisible) operation from the POV of other threads / cores. And yes, it's very much the core point. The LL/SC pair is inherently tied together: wherever the load goes, so goes the store. Anything else would, as you point out, make `acquire` insufficient for taking locks. Since we know it *is* sufficient, I'm pretty confident. (I used to wonder about this problem for LL/SC and what it meant for an atomic RMW to be only acq or only rel, but this is the understanding I reached.) — Peter Cordes, Oct 14 '19 at 17:06
Iʼve put an own answer to the question - would you please recheck? In two words, it suggests no atomic tie is needed at all due to explicit SC result check. — Netch, Feb 21 '20 at 09:58
@Netch: That's another way of explaining the same thing I was: assuming STXR succeeds, that means the LDAXR and STXR together appear as an atomic operation to other threads. (If it fails, it's an aborted transaction and only the load happened at all, and we aren't in the critical section. If we're talking about reordering with the critical section, that implies that we reached it at all via STXR succeeding, so I was only considering that case. Yes, of course we need a branch to check the result; that's the `if` around `atomic_compare_exchange_weak_explicit` = LDAXR/CBNZ/STXR) — Peter Cordes, Feb 21 '20 at 10:15
@Netch: *it suggests no atomic tie is needed at all*. No, LDAXR and STXR need to interact with each other to appear (to other threads) as a single atomic RMW operation. The setting and checking of the "monitor" in ARM terminology on the relevant cache line is what ties them together in the global order of operations. If not for that, multiple threads could enter the critical section, and the relaxed stores could reorder with the relaxed store in the critical section. I think my mental model and terminology differed from yours enough that my attempt to explain didn't make sense to you :/ — Peter Cordes, Feb 21 '20 at 10:21
Thanks for confirmation. For RMW atomicity, it seems I was inexact because didn't imply it at all, concentrating on "trick" for plain spinlock acquiring. But it could be useful for other, less obvious, usecases. To be considered some more. — Netch, Feb 21 '20 at 11:11
@Netch: I'm not sure which "trick" you're referring to. The asm in your answer is what you'd expect from a compiler for `while(!lock.compare_exchange_weak(expected=0, 1, std::memory_order_acquire) {}` before a critical section. CAS_weak is allowed to fail either because of SC failing or because of the compare part failing; we don't care which so there's only the one loop around LL/SC. Anyway, I think you understand how it works now, and probably not too much to be gained from talking about exactly what you misunderstood earlier. :) — Peter Cordes, Feb 21 '20 at 11:28
@Netch: Correction to my previous comments, an LL/SC isn't 100% tied together to appear adjacent in a global order of accesses. An SC-release can reorder after other later non-release stores, and loads (at least to other cache lines). It does still maintain the illusion of the RMW on that cache line being atomic. [For purposes of ordering, is atomic read-modify-write one operation or two?](https://stackoverflow.com/q/65568185) and [ARM STLR memory ordering semantics](https://stackoverflow.com/q/65466840). There are still limits to the ordering, though, for the SC to be able to succeed. — Peter Cordes, Mar 16 '21 at 00:22

Netch · Answer 2 · 2020-02-22T08:16:05.393

I have got an answer from other source that I would consider finally proper; here is my translation and rewording.

The principle that disallows instruction misordering is not some kind of implicit memory barrier - it could have been not implemented at all, and the operation will still be correct - but the fact that spinlock acquiring is checked and, unless it succeeds, a thread shall not continue with data access. The AArch64 example code (from the same answerer) is:

; Spinlock Acquire
    PRFM PSTL1KEEP, [X1] ; preload into cache in unique state
Loop
    LDAXR W5, [X1] ; read lock with acquire
    CBNZ W5, Loop ; check if 0
    STXR W5, W0, [X1] ; attempt to store new value
    CBNZ W5, Loop ; test if store succeeded and retry if not
; loads and stores in the critical region can now be performed
    STR X25, [X10]
; Spinlock Release
    STLR WZR, [X1] ; clear the lock with release semantics

STXR itself could have been reordered with other following accesses but, due to next CBNZ, it will not allow committing of following instructions unless STXR succeeds. (CPU may, in general, do some execution of them if predicts it would be useful, but shall not commit their results unless execution unambiguously reaches them.)

This looks obvious when explained but wasn't yet so before, seems my bad :(

(The answerer suggested reading section K11 of ARM® Architecture Reference Manual (ARMv8) for more details.)

However this doesn't refute, in any way, need for representing LL/SC pair atomically to other participants, if this is required - that is an nearly orthogonal question.

Why is acquire semantics only for reads, not writes? How can an LL/SC acquire CAS take a lock without the store reordering with the critical section?

2 Answers2

Linked