Can an acq_rel operation be split into an acquire and a release operation?

Question

Consider this C++ statement:

foo.exchange(bar, std::memory_order_acq_rel);

Can the above statement is exactly equivalent to any of the below?

1)

foo.exchange(bar, std::memory_order_acquire);
dummy.store(0, std::memory_order_release);

dummy.store(0, std::memory_order_release);
foo.exchange(bar, std::memory_order_acquire);

foo.exchange(bar, std::memory_order_release);
dummy.load(std::memory_order_acquire);

dummy.load(std::memory_order_acquire);
foo.exchange(bar, std::memory_order_release);

In case they are not equivalent, please mention why they are not.

For almost all purposes, ops on local atomic variables can be treated as non atomic operations (say normal integer ops), dismissing all the MT stuff. (Almost, as atomic ops can count against "infinite loops".) — curiousguy, Mar 08 '23 at 03:57

score 3 · Answer 1 · edited Mar 08 '23 at 10:40

For 1) and 2) no, some other thread that loads foo won't sync-with foo.exchange(acquire) in another thread, because it's only an acquire, not a release operation. So that other thread won't safely be able to read the values of non-atomic assignments from before the exchange, or get guaranteed values for earlier atomic stores.

The 3) and 4) have various problems in terms of (not) syncing with another writer or reader to create a happens-before relationship. That only happens when one thread does an acquire-load on the value from a release-store in another thread. If the store side of the exchange is relaxed, that doesn't happen.

IDK if you're thinking of dummy.store(0, std::memory_order_release); as being a 2-way barrier like atomic_thread_fence(release) but it's not, it's just a release operation, on a dummy variable that no other thread ever accesses (I assume.)

See https://preshing.com/20120913/acquire-and-release-semantics/ for a description in terms of local reordering of accesses to coherent shared memory. Acquire and release operations can reorder in one direction each. The dummy release store can reorder with any later operations except ones that are themselves release or stronger, so it might as well not exist.

What would be approximately equivalent (strictly stronger I think) is:

  // Any earlier operations can't reorder past the fence
std::atomic_thread_fence(std::memory_order_release);
  // and later stores can't reorder before the fence
foo.exchange(bar, std::memory_order_acquire);  // so this store is after any earlier ops

The load part of the exchange can still reorder with earlier loads/stores on other objects so it's not much stronger. (related: For purposes of ordering, is atomic read-modify-write one operation or two?)

Also fine would be foo.exchange(bar, release) ; thread_fence(acquire).

Another answer suggests foo.exchange(bar, release) ; foo.load(acquire) would be equivalent, but it's not. The acquire load might sync-with a different thread than the one whose value the exchange saw.

If you're really not using the return value of exchange to either check if you should do something (if(sequence_num > x)), or figure out what or where you should access (e.g. a pointer or array index), the acquire semantics of it is unlikely to matter at all.

But if we consider a reader like int idx = foo.exchange(bar, acq_rel); int tmp = arr[idx];, replacing the acq_rel exchange with int idx = foo.exchange(bar, release) ; foo.load(acquire) (ignoring the value of that acquire load) wouldn't be equivalent. Only an acquire barrier (fence) would order the load side of the exchange wrt. later operations.

If a store from a third thread becomes visible between the exchange(release) and load(acquire), you don't sync-with the thread that stored the value your exchange saw, only the third thread that stored the value you're ignoring.

Consider a writer that did arr[i] = 123; foo.store(i, release);
If a third thread did foo.store(0, relaxed); or whatever, the foo.load(acquire) would sync with it, not the one that wrote arr[idx]. This is of course a contrived example, and dependency ordering would save you on real CPUs even though the load side of foo.exchange was relaxed not consume. But ISO C++ formally guarantees nothing in that case. (And branching on the exchange result instead of using it as part of a load or maybe store address wouldn't let dependency ordering save you.)

If the third thread was also using exchange (even relaxed), that would create a release-sequence so your load would still sync-with the earlier writer as well. But a pure store doesn't guarantee that, breaking a release-sequence.

On most CPUs, where stores can only become visible to other threads by committing to coherent cache, the writer had to wait for exclusive ownership of the cache line just like for an atomic RMW. So plain stores can also continue a release-sequence, letting an acquire load sync-with all previous release stores and RMWs to the object. But ISO C++ doesn't formally guarantee that, and I wouldn't bet on it being safe on PowerPC where store-forwarding between logical cores is a thing. Except that on PPC, an acquire load is done with asm barriers, which would also strengthen the load part of an exchange.

Still, if you're trying to understand the C++ formalism, it's important to understand that the load who's value you actually use needs to be acquire, or there needs to be an acquire fence (not just operation).

The last example you gave, can it be changed to a release exchange followed by an acquire fence? — Sourav Kannantha B, Mar 07 '23 at 13:09
@SouravKannanthaB: BTW, I added a section about the other answer's suggestion that a dummy `foo.load(acquire)` after the exchange could be equivalent. It's not, if a store becomes visible between the exchange and load, you don't sync-with the thread that stored the value your exchange saw. — Peter Cordes, Mar 07 '23 at 19:59
By dependency ordering, did you mean ordering which preserves Store->Store? — Sourav Kannantha B, Mar 08 '23 at 10:32
@SouravKannanthaB: No, I meant what `memory_order_consume` gives you. [C++11: the difference between memory\_order\_relaxed and memory\_order\_consume](https://stackoverflow.com/a/59832012) / [What does memory\_order\_consume really do?](https://stackoverflow.com/q/65336409) / [Memory order consume usage in C11](https://stackoverflow.com/q/55741148) are some Q&As where I wrote about it. See also [Paul E. McKenney's CppCon 2016 talk about its deprecation in C++, and Linux RCU](https://www.youtube.com/watch?v=ZrNQKpOypqU&index=44&list=PLHTh1InhhwT75gykhs7pqcR_uSiG601oh) — Peter Cordes, Mar 08 '23 at 17:53

score 2 · Answer 2 · answered Mar 09 '23 at 02:20

Although the C++ memory model does not describe acquire/release semantics in terms of reordering, it's still a pretty good approximation. Acquire operations can be reordered with earlier operations, but not with later; release is the other way around.

It can be helpful visually to try it with cards on a table or something like that. Each card is a load/store/RMW operation, and you start with them in program order. Then the rule is that you may swap any two adjacent cards unless the left one is acquire, or the right one is release, or both.

In what's below, let X be your foo.exchange, which we will decorate as XA or XR according to whether it is acquire or release. Let DA/DR be the dummy acquire-load or release-store. Let P be any relaxed or non-atomic operation that is sequenced before both X and D, and Q another one that is sequenced after.

In the original version, we begin with simply P XAR Q. Since X is both acquire and release, it cannot be swapped with either P or Q. (It is possible for either P or Q to be reordered between the load and store within X, but that's not really relevant here.) So if in some replacement code there is any way to move either P or Q to the opposite side of X, then it is not equivalent to the original.

In #1 it is easy. You start with P XA DR B, but P and XA can be immediately swapped because XA is only acquire.

In #2 it takes a little more. You start with P DR XA Q, and you cannot swap P with DR, nor XA with Q. But you can swap DR with XA, and then P with XA.

P DR XA Q
P XA DR Q
XA P DR Q

I leave #3 and #4 as exercises, as they have similar solutions.

score 1 · Answer 3 · answered Mar 07 '23 at 09:31

1

The operations are completely different for a simple reason. Release operation on variable a is not equivalent in any way to release operation on variable b. To synchronize with the thread one would need to call acquire on variable b rather than a. That's the difference. Yes, the memory instruction are tied to variables.

So replacing acq_rel with lesser instruction on foo and an instruction on dummy will not properly synchornize with threads that call either acquire or release on foo depending on what instruction was called on foo.

Albeit if you called a discarded load on foo in addition to the exchange with the complememting instruction, the effect would be pretty much equivalent. Also you could call a general fence that would trigger a stronger synchronization instruction.

answered Mar 07 '23 at 09:31

ALX23z

4,456
1
11
18

1

*Albeit if you called a discarded load on foo in addition to the exchange* - I thought about mentioning that in my answer, but the load could see the value of a different store than the exchange did. So you could sync-with a different thread than the one you need to sync with to safely use the return value from the `exchange`. – Peter Cordes Mar 07 '23 at 09:36
1

(That's fine on real CPUs, at least ones without stuff like PowerPC's private store forwarding between logical cores; Writers had to get exclusive access to a cache line before they could commit a store, so a plain store is pretty much like an RMW in terms of creating a release-sequence. But ISO C++ doesn't guarantee that, and I suspect PowerPC can break that assumption.) – Peter Cordes Mar 07 '23 at 09:37
1

The code in the question doesn't save the `exchange` return value, so it can't ever know what the acquire part synced with. If it did sync with anything, there's a happens before anyway, but IDK if you can write code that depends on that without checking or using the value you loaded. So you could maybe argue that the acquire part never mattered if you don't save the result :P – Peter Cordes Mar 07 '23 at 09:39
@PeterCordes the thing is, they are not 100% equivalent technically. Separate `acquire` and `release` allow more configurations to happen. But I don't see any working routine that would fail by separating the two operations. – ALX23z Mar 07 '23 at 09:44
1

Consider a writer that did `arr[i] = 123;` / `foo.store(i, release);`. Then in the reader, `int idx = foo.exchange(bar, release);` / `foo.load(acquire)` in the reader, then use the index to read the "payload" with `arr[idx]`. If a third thread did `foo.store(0, relaxed)` or whatever, the `foo.load(acquire)` would sync with it, not the one that wrote `arr[idx]`. This is of course a contrived example, and dependency ordering would save you on real CPUs even though the load side of `foo.exchange` was relaxed not `consume`. But ISO C++ formally guarantees nothing in that case. – Peter Cordes Mar 07 '23 at 09:53
@PeterCordes hmm... I suppose one can write some bizzarre code that can fail when separating the two. Though, honestly, I don't believe I saw any production or prototype code that needed anything beyond relaxed, release, and acquire. – ALX23z Mar 07 '23 at 10:06
@PeterCordes Which modification you are synchronizing (the acquire load whose value you ignored) with after the measurement (a relaxed load whose value you tested) cannot make a difference if your program has only RMW acq+rel, or if it only has RMW that extend the release-sequence of the modification you measured. Otherwise it could make a difference. – curiousguy Mar 08 '23 at 03:46
1

@curiousguy: Yup, that's more or less what I said in the recent edit to my answer to say what I pointed out in comments, since it's something that almost but not quite works, and will in practice "work" on some CPUs even though ISO C++ doesn't guarantee it. – Peter Cordes Mar 08 '23 at 03:48

Can an acq_rel operation be split into an acquire and a release operation?

3 Answers3