Does std::atomic<> gurantee that store() operation is propagated immediately (almost) to other threads with load()?

Question

I have std::atomic<T> atomic_value; (for type T being bool, int32_t, int64_t and any other). If 1st thread does

atomic_value.store(value, std::memory_order_relaxed);

and in 2nd thread at some points of code I do

auto value = atomic_value.load(std::memory_order_relaxed);

How fast is this updated atomic value propagated from 1st thread to 2nd, between CPU cores? (for all CPU models)

Is it propagated almost-immediately? For example up-to speed of cache coherence propagation in Intel, meaning that 0-2 cycles or so. Maybe few more cycles for some other CPU models/manufacturers.

Or this value may stuck un-updated for many many cycles sometimes?

Does atomic guarantee that value is propagated between CPU cores as fast as possible for given CPU?

Maybe if instead on 1st thread I do

atomic_value.store(value, std::memory_order_release);

and on 2nd thread

auto value = atomic_value.load(std::memory_order_acquire);

then will it help to propagate value faster? (notice change of both memory orders) And now with speed guarantee? Or it will be same gurantee of speed as for relaxed order?

As a side question - does replacing relaxed order with release+acquire also synchronizes all modifications in other (non-atomic) variables?

Meaning that in 1st thread everything that was written to memory before store-with-release, is this whole memory guaranteed in 2nd thread to be exactly in final state (same as in 1st thread) at point of load-with-acquire, of course in a case if loaded value was new one (updated).

So this means that for ANY type of std::atomic<> (or std::atomic_flag) point of store-with-release in one thread synchronizes all memory writes before it with point in another thread that does load-with-acquire of same atomic, in a case of course if in other thread value of atomic got updated? (Sure if value in 2nd thread is not yet new then we expect that memory writes have not yet finished)

PS. Why question arose... Because according to name "atomic" it is obvious to conclude (probably miss-conclude) that by default (without extra constraints, i.e. with just relaxed memory order) std::atomic<> just makes any arithmetic operation atomical, and nothing else, no other guarantees about synchronization or speed of propagation. Meaning that write to memory location will be whole (e.g. all 4 bytes at once for int32_t), or exchange with atomic location will do both read-write atomically (actually in a locked fashion), or incrementing a value will do atomically three operations read-add-write.

Does this help? [Does hardware memory barrier make visibility of atomic operations faster...?](https://stackoverflow.com/q/61591287/2752075) TL;DR No, stronger memory orders don't increase the propagation speed, they just make the current thread wait more. *"does replacing relaxed order with release+acquire also synchronizes all modifications"* Yes, that's the only difference between acquire/release and relaxed. I've attempted an explanation of different memory orders [here](https://stackoverflow.com/a/70585811/2752075). — HolyBlackCat, May 08 '22 at 13:45
I highly recommend you watch both parts 1 & 2 of Herb's talks about `atomic`'s. https://www.youtube.com/watch?v=A8eCGOqgvH4 and https://www.youtube.com/watch?v=KeLBd2EJLOU — WBuck, May 08 '22 at 13:49
@HolyBlackCat Does it mean that I can just use relaxed memory order for both store/load, if I want just to send atomic value itself from one thread to another, without any other synchronization of surrounding writes? Is this relaxed store/load will send value as fast as possible for given CPU architecture? Meaning that any other order will not help at all in speeding up sending single value itself. My only concern about value being stuck is such that if std::atomic<> for example for Intel gurantees that .store() will not keep value in register, but Definitely will store it in cache line? — Arty, May 08 '22 at 15:21
I haven't tested this personally, but the linked Q&A implies that `relaxed` would be ok. — HolyBlackCat, May 08 '22 at 15:25

Nate Eldredge · Accepted Answer · 2022-05-08T20:32:31.490

The C++ standard says only this [C++20 intro.progress p18]:

An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.

Technically this is only a "should", and "finite time" is not very specific. But the C++ standard is broad enough that you can't expect them to specify a particular number of cycles or nanoseconds or what have you.

In practice, you can expect that a call to any atomic store function, even with memory_order_relaxed, will cause an actual machine store instruction to be executed. The value will not just be left in a register. After that, it's out of the compiler's hands and up to the CPU.

(Technically, if you had two or more stores in succession to the same object, with a bounded amount of other work done in between, the compiler would be allowed to optimize out all but the last one, on the basis that you couldn't have been sure anyway that any given load would happen at the right instant to see one of the other values. In practice I don't believe that any compilers currently do so.)

A reasonable expectation for typical CPU architectures is that the store will become globally visible "without unnecessary delay". The store may go into the core's local store buffer. The core will process store buffer entries as quickly as it can; it does not just let them sit there to age like a fine wine. But they still could take a while. For instance, if the cache line is currently held exclusive by another core, you will have to wait until it is released before your store can be committed.

Using stronger memory ordering will not speed up the process; the machine is already making its best efforts to commit the store. Indeed, a stronger memory ordering may actually slow it down; if the store was made with release ordering, then it must wait for all earlier stores in the buffer to commit before it can itself be committed. On strongly-ordered architectures like x86, every store is automatically release, so the store buffer always remains in strict order; but on a weakly ordered machine, using relaxed ordering may allow your store to "jump the queue" and reach L1 cache sooner than would otherwise have been possible.

What about `.exchange(... std::memory_order_relaxed ...)` method of atomic? Does it Guarantee that this method doesn't finish until ALL cores of CPU see updated (exchanged) value, do you know? So can it ever happen that atomic hold value 0, then 1st thread did .exchange() it to value 1, at same time 2nd thread exchanged it for value 2 - can it ever happen that both 1st and 2nd thread have previous value of 0 being returned same time from .exchange()? In this case it means 1st thread exchange 0 to 1, then 1 didn't propagate yet, 2nd thread exchange same old 0 to 2,both threads got 0 in return — Arty, May 08 '22 at 18:45
@Arty: Those are separate questions. None of these functions guarantee that they "don't finish" until all cores see the updated value; that could still be delayed. But they do guarantee the atomicity: for an atomic RMW, the value loaded and the value stored are consecutive in the modification order. In your example, it cannot ever happen that both threads return 0. — Nate Eldredge, May 08 '22 at 18:50
@Arty: A possible implementation could look like this: thread 1 holds the cache line in [exclusive state](https://en.wikipedia.org/wiki/MESI_protocol), loads the value 0, executes a store of the value 1. The store may not commit instantly, and the core may go on executing other instructions, but the cache line will remain exclusive to this core until it does commit. No other core will be able to complete a load or store to that cache line until after the new value has become visible. — Nate Eldredge, May 08 '22 at 18:53
@Arty: Generally, thinking about "time" or "immediacy" in the context of modern CPUs and memory models is not the way to go. Think instead about *ordering*. The exact time at which operations occur is not important so long as you know which ones will see the results of which others. Trying to assign a "time" to each operation is only possible if they are totally ordered, which is more or less to say, sequentially consistent. Modern memory models don't necessarily promise that this is the case. — Nate Eldredge, May 08 '22 at 18:56
@Arty: FYI, for atomic RMW, the citation is [atomics.order p10], combined with the fact that the modification order is a total order [intro.races p4]. So in your example, between the store of 1 and the store of 2, one of them precedes the other in modification order. If the store of 1 comes first, then `.exchange(2)` must return 1, not 0. And a similar argument if the store of 2 comes first. — Nate Eldredge, May 08 '22 at 20:26

Does std::atomic<> gurantee that store() operation is propagated immediately (almost) to other threads with load()?

1 Answers1