There's nothing you can do to make a store visible to other threads any sooner. See If I don't use fences, how long could it take a core to see another core's writes? - barriers don't speed up visibility to other cores, they just make this core wait until that's happened.
The store part of an RMW is not different from a pure store for this, either.
(Certainly on x86; not totally sure about other ISAs, where a relaxed LL/SC might possibly get special treatment from the store buffer, possibly being more likely to commit before other stores if this core can get exclusive ownership of the cache line. But I think it still would have to retire from out-of-order execution so the core knows it's not speculative.)
Anthony's answer that was linked in comment is misleading; as I commented there:
If the RMW runs before the other thread's store commits to cache, it doesn't see the value, just like if it was a pure load. Does that mean "stale"? No, it just means that the store hasn't happened yet.
The only reason RMWs need a guarantee about "latest" value is that they're inherently serializing operations on that memory location. This is what you need if you want 100 unsynchronized fetch_add
operations to not step on each other and be equivalent to += 100, but otherwise best-effort / latest-available value is fine, and that's what you get from a normal atomic load.
If you require instant visibility of results (a nanosecond or so), that's only possible within a single thread, like x = y; x += z;
Also note, the C / C++ standard requirement (actually just a note) to make stores visible in a reasonable amount of time is in addition to the requirements on ordering of operations. It doesn't mean seq_cst store visibility can be delayed until after later loads. All seq_cst operations happen in some interleaving of program order across all threads.
On real-world C++ implementations, the visibility time is entirely up to hardware inter-core latency. But the C++ standard is abstract, and could in theory be implemented on a CPU that required manual flushing to make stores visible to other threads. Then it would be up to the compiler to not be lazy and defer that for "too long".
volatile atomic<T>
is useless; compilers already don't optimize atomic<T>
, so every atomic
access done by the abstract machine will already happen in the asm. (Why don't compilers merge redundant std::atomic writes?). That's all that volatile does, so volatile atomic<T>
compiles to the same asm as atomic<T>
for anything you can with the atomic.
Defining "stale" is a problem because separate threads running on separate cores can't see each other's actions instantly. It takes tens of nanoseconds on modern hardware to see a store from another thread.
But you can't read "stale" values from cache; that's impossible because real CPUs have coherent caches. (That's why volatile int
could be used to roll your own atomics before C++11, but is no longer useful.) You may need an ordering stronger than relaxed
to get the semantics you want as far as one value being older than another (i.e. "reordering", not "stale"). But for a single value, if you don't see a store, that means your load executed before the other core took exclusive ownership of the cache line in order to commit its store. i.e. that the store hasn't truly happened yet.
In the formal ISO C++ rules, there are guarantees about what value you're allowed to see which effectively give you the guarantees you'd expect from cache coherency for a single object, like that after a reader sees a store, further loads in this thread won't see some older store and then eventually back to the newest store. (https://eel.is/c++draft/intro.multithread#intro.races-19).
(Note for 2 writers + 2 readers with non-seq_cst operations, it's possible for the readers to disagree about the order in which the stores happened. This is called IRIW reordering, but most hardware can't do it; only some PowerPC. Will two atomic writes to different locations in different threads always be seen in the same order by other threads? - so it's not always quite as simple as "the store hasn't happened yet", it can be visible to some threads before others. But it's still true that you can't speed up visibility, only for example slow down the readers so none of them see it via the "early" mechanism, i.e. with hwsync
for the PowerPC loads to drain the store buffer first.)