How do I make memory stores in one thread "promptly" visible in other threads?

Question

Suppose I wanted to copy the contents of a device register into a variable that would be read by multiple threads. Is there a good general way of doing this? Here are examples of two possible methods of doing this:

#include <atomic>

volatile int * const Device_reg_ptr = reinterpret_cast<int *>(0x666);

// This variable is read by multiple threads.
std::atomic<int> device_reg_copy;

// ...

// Method 1
const_cast<volatile std::atomic<int> &>(device_reg_copy)
  .store(*Device_reg_ptr, std::memory_order_relaxed);

// Method 2
device_reg_copy.store(*Device_reg_ptr, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release);

More generally, in the face of possible whole program optimization, how does one correctly control the latency of memory writes in one thread being visible in other threads?

EDIT: In your answer, please consider the following scenario:

The code is running on a CPU in an embedded system.
A single application is running on the CPU.
The application has far fewer threads than the CPU has processor cores.
Each core has a massive number of registers.
The application is small enough that whole program optimization is successfully used when building its executable.

How do we make sure that a store in one thread does not remain invisible to other threads indefinitely?

use of `std::atomic` already addresses the visibility concerns (ie. no data races or dirty reads). The memory order of load/store operations however can still be a concern, but that can easily be controlled by passing the appropriate `std::memory_order` to `std::atomic::load` resp. `std::atomic::store`. No need for `volatile` or additional memory fences. — Sander De Dycker, Feb 02 '17 at 14:14
Atomic loads can be optimize so there is not a 1-1 correspondence between nominal loads and run-time execution of load instructions. Volatile guarantees this 1-1 correspondence. — WaltK, Feb 02 '17 at 15:04
Putting _volatile_ before the _*_ makes the register and not the pointer volatile, which is correct. — WaltK, Feb 02 '17 at 15:05
@WaltK : such optimizations are only allowed if they don't violate the guarantees provided by `std::atomic` and the chosen `std::memory_order`. Regardless, [`volatile` is of no use for an `std::atomic`](http://stackoverflow.com/questions/8819095/concurrency-atomic-and-volatile-in-c11-memory-model) - instead use the appropriate `std::memory_order` for your requirements. — Sander De Dycker, Feb 02 '17 at 15:59
that said - if you require more guarantees than those provided by `std::atomic`, and if none of the `std::memory_order`s fulfill your requirements, you'll probably have to look at other primitives (which will probably be platform specific). — Sander De Dycker, Feb 02 '17 at 16:07

Maxim Egorushkin · Answer 1 · 2017-02-02T14:24:18.270

If you would like to update the value of device_reg_copy in atomic fashion, then device_reg_copy.store(*Device_reg_ptr, std::memory_order_relaxed); suffices.

There is no need to apply volatile to atomic variables, it is unnecessary.

std::memory_order_relaxed store is supposed to incur the least amount of synchronization overhead. On x86 it is just a plain mov instruction.

However, if you would like to update it in such a way, that the effects of any preceding stores become visible to other threads along with the new value of device_reg_copy, then use std::memory_order_release store, i.e. device_reg_copy.store(*Device_reg_ptr, std::memory_order_release);. The readers need to load device_reg_copy as std::memory_order_acquire in this case. Again, on x86 std::memory_order_release store is a plain mov.

Whereas if you use the most expensive std::memory_order_seq_cst store, it does insert the memory barrier for you on x86.

This is why they say that x86 memory model is a bit too strong for C++11: plain mov instruction is std::memory_order_release on stores and std::memory_order_acquire on loads. There is no relaxed store or load on x86.

I cannot recommend enough CPU Cache Flushing Fallacy article.

score 2 · Answer 2 · answered Feb 02 '17 at 20:05

The C++ standard is rather vague about making atomic stores visible to other threads..

29.3.12 Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.

That is as detailed as it gets, there is no definition of 'reasonable', and it does not have to be immediately.

Using a stand-alone fence to force a certain memory ordering is not necessary since you can specify those on atomic operations, but the question is, what is your expectation with regards to using a memory fence..
Fences are designed to enforce ordering on memory operations (between threads), but they do not guarantee visibility in a timely manner. You can store a value to an atomic variable with the strongest memory ordering (ie. seq_cst), but even when another thread executes load() at a later time than the store(), you might still get an old value from the cache and yet (surprisingly) it does not violate the happens-before relationship. Using a stronger fence might make a difference wrt. timing and visibility, but there are no guarantees.

If prompt visibility is important, I would consider using a Read-Modify-Write (RMW) operation to load the value. These are atomic operations that read and modify atomically (ie. in a single call), and have the additional property that they are guaranteed to operate on the latest value. But since they have to reach a little further than the local cache, these calls also tend to be more expensive to execute.

As pointed out by Maxim Egorushkin, whether or not you can use weaker memory orderings than the default (seq_cst) depends on whether other memory operations need to be synchronized (made visible) between threads. That is not clear from your question, but it is generally considered safe to use the default (sequential consistency).
If you are on an unusually weak platform, if performance is problematic, and if you need data synchronization between threads, you could consider using acquire/release semantics:

// thread 1
device_reg_copy.store(*Device_reg_ptr, std::memory_order_release);


// thread 2
device_reg_copy.fetch_add(0, std::memory_order_acquire);

If thread 2 sees the value written by thread 1, it is guaranteed that memory operations prior to the store in thread 1 are visible after the load in thread 2. Acquire/Release operations form a pair and they synchronize based on a run-time relationship between the store and load. In other words, if thread 2 does not see the value stored by thread 1, there are no ordering guarantees.

If the atomic variable has no dependencies on any other data, you can use std::memory_order_relaxed; store ordering is always guaranteed for a single atomic variable.

As mentioned by others, there is no need for volatile when it comes to inter-thread communication with std::atomic.

Atomic RWMs aren't better for seeing a recent store. See the last section of [my answer on another Q](https://stackoverflow.com/a/71722126/224132). Atomic RMWs need exclusive ownership of the cache line (which slows down the reading thread), but other than that they don't have to "reach further". They keep exclusive ownership of the cache line for the duration of the RMW, so no other core can access it. Using an RMW slows down this thread so in that sense it's more likely to have seen a recent store, but it's still just down to hardware cache coherency whether you see a recent store or not — Peter Cordes, Aug 25 '22 at 20:00
An RMW does stop out-of-order exec from taking a load value as early (in the out-of-order execution window), especially if this core doesn't already have exclusive ownership of the cache line. But do you really want to slow down the reader all the time for that? And slow down the writer, if it was doing anything else with other data in the same cache line. Serializing reads with writes may possibly be useful sometimes, but doesn't remove the possibility of the read not seeing a write that executes (but hasn't yet committed to L1d cache) a few nanoseconds earlier. — Peter Cordes, Aug 25 '22 at 20:10
TL:DR: I wouldn't recommend an atomic RMW. It's not worth making the reader slower overall (especially on x86 where it's a full barrier), and causing extra cache coherency traffic. **An RMW slows down the hopefully-common case of the value you want to see already being ready, potentially by a lot, especially on x86.** Just for a relatively small gain in doing the load maybe a bit later relative to other operations in the same thread. — Peter Cordes, Aug 25 '22 at 20:21
Now I'm curious how small the effect is, relative to typical inter-core latency of maybe 40ns (that's I think half the round-trip time for cores to bounce a cache line back and forth in a ping-pong test on a typical modern desktop.) I'd guess smaller than that. Hmm, on a CPU with an out-of-order exec window size of 300 instructions, running an average of 2 IPC, at 4GHz, that's 37.5 ns between the oldest and youngest instructions in the ROB. But if an RMW has to stall to wait for an RFO, that would reduce IPC. (So delaying the reader, like a `pause` or sleep or delay loop before a pure load — Peter Cordes, Aug 25 '22 at 20:34
**If you have parallel readers, you definitely don't want RMWs** as that would serialize them with each other. — Peter Cordes, Aug 25 '22 at 20:35

How do I make memory stores in one thread "promptly" visible in other threads?

2 Answers2

Linked