Is it possible that a store with memory_order_relaxed never reaches other threads?

Question

Suppose I have a thread A that writes to an atomic_int x = 0;, using x.store(1, std::memory_order_relaxed);. Without any other synchronization methods, how long would it take before other threads can see this, using x.load(std::memory_order_relaxed);? Is it possible that the value written to x stays entirely thread-local given the current definition of the C/C++ memory model that the standard gives?

The practical case that I have at hand is where a thread B reads an atomic_bool frequently to check if it has to quit; Another thread, at some point, writes true to this bool and then calls join() on thread B. Clearly I do not mind to call join() before thread B can even see that the atomic_bool was set, nor do I mind when thread B already saw the change and exited execution before I call join(). But I am wondering: using memory_order_relaxed on both sides, is it possible to call join() and block "forever" because the change is never propagated to thread B?

Edit

I contacted Mark Batty (the brain behind mathematically verifying and subsequently fixing the C++ memory model requirements). Originally about something else (which turned out to be a known bug in cppmem and his thesis; so fortunately I didn't make a complete fool of myself, and took the opportunity to ask him about this too; his answer was:

Q: Can it theoretically be that such a store [memory_order_relaxed without (any following) release operation] never reaches the other thread?
Mark: Theoretically, yes, but I don't think that has been observed.
Q: In other words, do relaxed stores make no sense whatsoever unless you combine them with some release operation (and acquire on the other thread), assuming you want another thread to see it?
Mark: Nearly all of the use cases for them do use release and acquire, yes.

The Edit is more like an answer; but since it isn't my answer I decided to add it as edit rather than as answer. I hope some might find the opinion of this expert useful. — Carlo Wood, May 04 '17 at 01:47
It is about the C++ memory model that was introduced in C++11. In practise any write to memory is going to be visible to all other threads within a few micro seconds and probably much faster, even if you don't include assembly instructions that flush the cache to memory. Most notably, on Intel there isn't a difference at all between a relaxed store and a release store (with regard to assembly and hardware - compiler reordering not included in this remark). — Carlo Wood, Dec 19 '19 at 02:06
Which implementations generate instructions that "flush the cache to memory"? In which cases? — curiousguy, Dec 19 '19 at 03:58
Nothing as far as I know. It wouldn't make sense. All you can do is add a memory fence (or any other 'memory_order_release' operation) which would at least assure that everything gets flushed to memory before subsequent writes to memory will be. — Carlo Wood, Dec 20 '19 at 16:32

score 9 · Answer 1 · edited Feb 15 '19 at 16:11

9

This is all the standard has to say on the matter, I believe:

[intro.multithread]/25 An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.

edited Feb 15 '19 at 16:11

rustyx

80,671
25
200
267

answered May 03 '17 at 04:51

Igor Tandetnik

50,461
4
56
85

1

And in practice hardware that std::thread starts threads on has coherent caches, not requiring software flushing, so visibility time = time for the store buffer to commit your store. When that happens, other cores will see a MESI invalidate/RFO from the storing thread, then have to do a share request themselves to get a copy of the new value. See [When to use volatile with multi threading?](//stackoverflow.com/a/58535118) for more details about the fact that ISO C++ is written to run on cache-coherent hardware, and running without that is barely plausible. – Peter Cordes Dec 13 '19 at 08:03
My answer on [Why set the stop flag using \`memory\_order\_seq\_cst\`, if you check it with \`memory\_order\_relaxed\`?](https://stackoverflow.com/a/70593598) also quotes *33.5.4 Order and consistency [atomics.order]* - *11. Implementations should make atomic stores visible to atomic loads **within a reasonable amount of time**.* So that's two *should* requirements, one with "finite period" and one with "reasonable amount of time". The standard leaves it as basically a quality-of-implementation factor; real hardware is what gives us low latency. – Peter Cordes Nov 24 '22 at 10:58

score 0 · Answer 2 · answered Dec 13 '19 at 05:29

In practice

Without any other synchronization methods, how long would it take before other threads can see this, using x.load(std::memory_order_relaxed);?

No time. It's a normal write, it goes to the store buffer, so it will be available in the L1d cache in less time than a blink. But that's only when the assembly instruction is run.

Instructions can be reordered by the compiler, but no reasonable compiler would reorder atomic operation over arbitrarily long loops.

In theory

Q: Can it theoretically be that such a store [memory_order_relaxed without (any following) release operation] never reaches the other thread?

Mark: Theoretically, yes,

You should have asked him what would happen if the "following release fence" was added back. Or with atomic store release operation.

Why wouldn't these be reordered and delayed a loooong time? (so long that it seems like an eternity in practice)

Is it possible that the value written to x stays entirely thread-local given the current definition of the C/C++ memory model that the standard gives?

If an imaginary and especially perverse implementation wanted to delay the visibility of atomic operation, why would it do that only for relaxed operations? It could well do it for all atomic operations.

Or never run some threads.

Or run some threads so slowly that you would believe they aren't running.

LWimsey · Accepted Answer · 2017-05-03T05:36:33.040

-2

This is what the standard says in 29.3.12:

Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.

There is no guarantee a store will become visible in another thread, there is no guaranteed timing and there is no formal relationship with memory order.

Of course, on each regular architecture a store will become visible, but on rare platforms that do not support cache coherency, it may never become visible to a load.
In that case, you would have to reach for an atomic read-modify-write operation to get the latest value in the modification order.

edited May 03 '17 at 05:36

answered May 03 '17 at 05:09

LWimsey

6,189
2
25
53

Are you sure this is said about std::memory_order_relaxed (too)? I can image that this remark is necessary even for just release/acquire store/reads because in that case we know about ordering, but still nothing was said about timing; ie, put two single core PC's next to each other and they will obey the standard if it wasn't for this remark ;). – Carlo Wood May 04 '17 at 01:32
1

@CarloWood Absolutely.. It's a common misconception that memory ordering is related to visibility of the `atomic` variable itself; It is not.. (what would be the use of relaxed atomics if they never became visible to other cores?). `acquire/release` semantics specify ordering (and thus visibility) of other memory operations with respect to an `atomic` operation. If an `atomic` variable does not become visible in another thread, neither do the memory operations it orders. – LWimsey May 04 '17 at 01:51
"_but on rare platforms_" Could you give examples? – curiousguy Dec 12 '18 at 00:53
@curiousguy I cannot give you an example, but cache-coherency is an optional feature. You might find a non-cache-coherent architecture in the embedded world. – LWimsey Dec 12 '18 at 02:15
2

Doesn't this rule guarantees a store will become visible in another thread, since a reasonable amount of time certainly excludes infinite time? – xskxzr Feb 16 '19 at 06:04
If `std::thread` starts your threads across non-cache-coherent cores, and `std::atomic` doesn't manually flush that line for relaxed, or everything for release, then your C++ implementation is almost certainly broken. Remember that for each atomic variable separately, a single modification order (that *all* threads can agree on) must exist, even with mo_relaxed stores. (This doesn't include observers that see their own store-forwarding early, though). I don't think letting an atomic stay thread-local could be considered valid, at least not by the spirit of the standard. – Peter Cordes Dec 13 '19 at 09:15
And yes there are embedded CPUs with both a microcontroller and DSP on chip that aren't cache-coherent with each other, but `std::thread` won't start threads across cores on both. See [When to use volatile with multi threading?](//stackoverflow.com/a/58535118) - e.g. "*This architecture (ARMv7) is written with an expectation that all processors using the same operating system or hypervisor are in the same Inner Shareable shareability domain*". If you take pointers to shared non-coherent memory and cast them to `std::atomic*` without also using manual flushing, UB is your own fault. – Peter Cordes Dec 13 '19 at 09:19

Is it possible that a store with memory_order_relaxed never reaches other threads?

Edit

3 Answers3

In practice

In theory

Linked