atomic exchange with memory_order_acquire and memory_order_release

Question

I have a situation that I would like to prepare some data in one thread:

// My boolean flag
std::atomic<bool> is_data_ready = false;

Thread 1 (producer thread):
  PrepareData();
  if (!is_data_ready.exchange(true, std::memory_order_release)) {
    NotifyConsumerThread();
  }
  else {
    return;
  }

In consumer thread,

Thread 2:
  if (is_data_ready.exchange(false, std::memory_order_acquire)) {
    ProcessData();
  }

Does it make sense to use acquire/release order (instead of acq_rel order) for exchange? I am not sure if I understand it correctly: does std::memory_order_release in exchange mean the store is a release store? If so, what is the memory order for the load?

Have you checked out https://en.cppreference.com/w/cpp/atomic/memory_order ? — Taekahn, Sep 17 '22 at 22:07
It's really difficult doing lockless concurrent programming. And even if you get it just right, it will be difficult to maintain - especially for the next person who inherits this code. Just use a `std::mutex` — selbie, Sep 17 '22 at 22:13
@Taekahn Yeah. thanks. I think I took everything in a too verbatim way, which gets me confused. Release order is for a store. Exchange is a load-store and I feel the store should be a release for the way I used. But I want get confirmation. — user1101010, Sep 17 '22 at 22:14
My understanding is that since you’re preforming exchanges on both sides instead of individual store/load operations that it doesn’t matter which ordering you pick. As an aside, it looks like you’re using the atomic as a notification. have you considered a semaphore? If you can’t do c++ 20, there is also abseil Notification. As a final resort you can fire off a notification with C++ 11 Futures. There is also probably something in boost. What I’m saying in a round about way is that I think there are better ways to handle producer/consumer synchronization. — Taekahn, Sep 17 '22 at 22:52
@Taekahn Thanks for the suggestion. I took out details of my situation. The boolean flag is much needed. As for the atomic exchange itself, do you mean it's sequential consistent anyway? — user1101010, Sep 17 '22 at 23:27

score 3 · Accepted Answer · answered Sep 18 '22 at 00:12

An atomic RMW has a load part and a store part. memory_order_release gives the store side release semantics, while leaving the load side relaxed. The reverse for exchange(val, acquire). With exchange(val, acq_rel) or seq_cst, the load would be an acquire load, the store would be a release store.

(compare_exchange_weak/_strong can have one memory order for the pure-load case where the compare failed, and a separate memory order for the RMW case where it succeeds. This distinction is meaningful on some ISAs, but not on ones like x86 where it's just a single instruction that effectively always stores, even in the false case.)

And of course atomicity of the exchange (or any other RMW) is guaranteed regardless of anything else; no stores or RMWs to this object by other cores can come between the load and store parts of the exchange. Notice that I didn't mention pure loads, or operations on other objects. See later in this answer and also For purposes of ordering, is atomic read-modify-write one operation or two?

Yes, this looks sensible, although simplistic and maybe racy in allowing more stuff to be published after the first batch is consumed (or started to consume)¹. But for the purposes of understanding how atomic RMWs work, and the ordering of its load and store sides, we can ignore that.

exchange(true, release) "publishes" some shared data stored by PrepareData(), and checks the old value to see if the worker thread needs to get notified.

And in the reader, is_data_ready.exchange(false, acquire) is a load that syncs with the release-store if there was one, creating a happens-before relationship that makes it safe to read that data without data-race UB. And tied to that (as part of the atomic RMW), lets other threads see that it has gone past the point of checking for new work, so it needs another notify if there is any.

Yes, exchange(value, release) means the store part of the RMW has release ordering wrt. other operations in the same thread. The load part is relaxed, but the load/store pair still form an atomic RMW. So the load can't take a value until this core has exclusive ownership of the cache line.

Or in C++ terms, it sees the "latest value" in the modification order of is_data_ready; if some other thread was also storing to is_data_ready, that store will happen either before the load (before the whole exchange), or after the store (after the whole exchange).

Note that a pure load in another core coming after the load part of this exchange is indistinguishable from coming before, so only operations that involve a store are part of the modification order of an object. (That modification order is guaranteed to exist such that all threads can agree on it, even when you're using relaxed loads/stores.)

But the load part of another atomic RMW will have to come before the load part of the exchange, otherwise that other RMW would have this exchange happening between its load and its store. That would violate the atomicity guarantee of the other RMW, so that can't happen. Atomic RMWs on the same object effectively serialize across threads. That's why a million fetch_add(1, mo_relaxed) operations on an atomic counter will increment it by 1 million, regardless of what order they end up running in. (See also C++: std::memory_order in std::atomic_flag::test_and_set to do some work only once by a set of threads re: why atomic RMWs have to work this way.)

C++ is specified in terms of syncs-with and whether a happens-before guarantee exists that allows your other loads to see other stores by other threads. But humans often like to think in terms of local reordering (within execution of one thread) of operations that access shared memory (via coherent cache).

In terms of a memory-reordering model, the store part of an exchange(val, release) can reorder with later operations other than release or seq_cst. (Note that unlocking a mutex counts as a release operation). But not with any earlier operations. This is what acquire and release semantics are all about, as Jeff Preshing explains: https://preshing.com/20120913/acquire-and-release-semantics/.

Wherever the store ends up, the load is at some point before it. Right before it in the modification order of is_data_ready, but operations on other objects by this thread (especially in other cache lines) may be able to happen in between the load and store parts of an atomic exchange.

In practice, some CPU architectures don't make that possible. Notably x86 atomic RMW operations are always full barriers, which waits for all earlier loads and stores to complete before the exchange, and doesn't start any later loads and stores until after. So not even StoreLoad reordering of the store part of an exchange with later loads is possible on x86.

But on AArch64 you can observe StoreLoad reordering of the store part of a seq_cst exchange with a later relaxed load. But only the store part, not the load part; being seq_cst means the load part of the exchange has acquire semantics and thus happens before any later loads. See For purposes of ordering, is atomic read-modify-write one operation or two?

Footnote 1: is this a usable producer/consumer sync algorithm?

With a single boolean flag (not a queue with a read-index / write-index), IDK how a producer would know when it can overwrite the shared variables that the consumer will look at. If it (or another producer thread) did that right away after seeing is_data_ready == false, you'd race with the reader that's just started reading.

If you can solve that problem, this does appear to avoid the possibility of the consumer missing an update and going to sleep, as long as it handles the case where a second writer adds more data and sends a notify before the consumer finishes ProcessData. (The writers only know that the consumer has started, not when it finishes.) I guess this example isn't showing the notification mechanism, which might itself create synchronization.

If two producers run PrepareData() at overlapping times, the first one to finish will send a notification, not both. Unless the consumer does an exchange and resets is_data_ready between the two exchanges in the producers, then it will get a second notification. (So that sound pretty hard to deal with in the consumer, and in whatever data structure PrepareData() manages, unless it's something like a lock-free queue itself, in which case just check the queue for work instead of this mechanism. But again, this is still a usable example to talk about how exchange works.)

If a consumer is frequently checking and finding no work needing doing, that's also extra contention that could have been avoided if it checks read-only until they see a true and exchange it to false (with an acquire exchange). But since you're worrying about notifications, I assume it's not a spin-wait loop, instead sleeping if there isn't work to do.

Thanks for this amazing answer. For the footnote question, my data has simple structure and is atomic by themselves. Also, the producers are free to overwrite shared data as long as the most recent update gets handled by the consumer. Redundant notifications to consumer thread is also acceptable on my platform. `ProcessData()` also maintains previously processed data state and will not process if same data is published. — user1101010, Sep 18 '22 at 18:38
@user1101010: Ok, so the shared data is a single lock-free atomic object? Or you update atomically using something like RCU or a SeqLock, so the reader can still read or copy a consistent snapshot. I guess you'd still need that separate boolean to avoid some redundant notifications when you have updates faster than the consumer reads them (I think that's the point of this?), unless you wanted to use the low bit of a SeqLock's sequence counter for that. Anyway, glad it helped. — Peter Cordes, Sep 18 '22 at 19:17
Yes, the shared data is updated atomically. And you're right on the point of this arrangement as well: producers can update faster than a single consumer can process. But it's fine consumer misses most of the updates as long as it does not miss the most recent one. — user1101010, Sep 19 '22 at 13:52

atomic exchange with memory_order_acquire and memory_order_release

1 Answers1

Footnote 1: is this a usable producer/consumer sync algorithm?

Linked