C++11 atomics. visibility and thread.join() / correct way to stop a thread

Question

For which (if any?) STORE_ORDER & LOAD_ORDER does C++11 guarantee that this code runs in finite time?

std::atomic<bool> a{false};
std::thread t{[&]{
  while(!a.load(LOAD_ORDER));
}};
a.store(true, STORE_ORDER);
t.join();

I see two issues with this:

Memory order

It seems to me that with release & aquire, the compiler and cpu are allowed to reorder my join (assuming it behaves like a load) before the store, which would of course break this.

Even with memory_order_seq_cst, I'm not sure if such a reordering is prohibited because I don't know if join() actually does any loads or stores.

Visibility

If I understood this question about memory_order_relaxed correctly, it is not guaranteed that a store with memory_order_relaxed becomes visible to other threads in a finite amount of time. Is there such a guarantee for any of the other orderings?

I understand that std::atomic is about atomicity and memory ordering, not about visibility. But I am not aware of any other tools in c++11 that could help me here. Would I need to use a platform-specific tool to get a correctness guarantee here and if yes, which one?

To take this one step further – if I have finiteness, it would be nice to also have some promise about speed. I don't think the C++ standard makes any such promises. But is there any compiler or x86-specific way to get a promise that the store will become visible to the other thread quickly?

In summary: I'm looking for a way to swiftly stop a worker thread that is actually guaranteed to have this property. Ideally this would be platform-independent. But if we can't have that, does it at least exist for x86?

@ChrisDodd Does it say that in the standard? I couldn't find such a note. — Chronial, Feb 15 '19 at 16:33
@rustyx what does "will work" mean in the context of this question and why would that be case? — Chronial, Feb 15 '19 at 16:35
Per 30.1.3.5, a (successful) join synchronizes with the thread. Since it doesn't refer to a specific operation within the thread, that means it synchronizes with ALL operations in the thread -- in essence a full memory barrier. — Chris Dodd, Feb 15 '19 at 16:41
@ChrisDodd My understanding is that synchronizes-with is a non-reflexive relationship and goes in the wrong direction in this cases. It synchronizes all operations in `t` with the `join`. This does still allow moving the store after the join, does it not? — Chronial, Feb 15 '19 at 16:57
The store can never "move after" the join as there's a sequence point between. So it will always be before the join in the thread that's doing it. The only reordering question is if some other thread might see the join before the store. That may be so, but doesn't matter -- even if the other thread 'sees' the join before the store, it won't complete until it sees the store, and that will eventually happen. — Chris Dodd, Feb 15 '19 at 17:59
One way of thinking about C++ memory model is like this: code executes in program order, but effects get observed by other threads in unspecified order at unspecified time (relaxed model) or "better" (stricter models). Which means that if your thread is currently executing `join()` -- it is guaranteed that other threads will *eventually* observe effect of your `store()` (regardless of memory model you used). Simply because your `store()` happens before `join()` in your program. — C.M., Feb 15 '19 at 19:28
@ChrisDodd: Exactly that is the core of my question. I know that if the worker stopped, it saw the store. But where do you get "...and that will eventually happen" from? I couldn't find anything that would guarantee that. — Chronial, Feb 15 '19 at 19:36
@C.M. that statement is clearly not true in general: https://godbolt.org/z/xYxqbq. Thus, there needs be some guarantees on `store` and `join`. For relaxed ordering, there are [no guarantees](https://en.cppreference.com/w/cpp/atomic/memory_order#Relaxed_ordering) on `store`. For `join`: the completion of the worker-thread synchronizes-with the `join`, which to my understanding only makes guarantees about things that happen in the worker before its end and in the mainthread after the join. — Chronial, Feb 15 '19 at 19:59
@Chronial: Your gobolt example illustrates this exactly -- *because* there is nothing in the loop that synchronizes to any other thread (such a a `join` does), the compiler can elide the x = 1 assignment. If there was any such statement, the compiler would not be able to and would have to emit the assignment. — Chris Dodd, Feb 15 '19 at 23:01

score 2 · Answer 1 · edited Feb 16 '19 at 06:01

2

After some more searching, I found a question that is identical to the visibility part of mine, which got a clear answer: There is indeed no such guarantee – there is only the request that "implementations should make atomic stores visible to atomic loads within a reasonable amount of time". The standard does not define what it means by should, but I will assume the normal meaning, so this would be non-binding. It also not quite clear what "reasonable" means, but I would assume it clearly excludes "infinite".

This doesn't quite answer the question about memory ordering. But if the store is ordered after the join(), which may block forever, the store would never become visible to the other threads – which would not be a "reasonable amount of time".

So while the standard does not require the code in the question to be valid, it at least suggests that it should be valid. As a bonus, it actually says that it shouldn't just be finite time, but also somewhat fast (or well, reasonable).

That leaves the part of my question about a platform-specific solution: Is there a x86-specific way to write the requested algorithm so it is actually guaranteed to be correct?

edited Feb 16 '19 at 06:01

xskxzr

12,442
12
37
77

answered Feb 16 '19 at 01:21

Chronial

66,706
14
93
99

1

about "x86-specific way" -- I believe you don't need to do anything for this to work on this platform. x86 is very "sequentially consistent". You just need some sort of compiler-lvl fence (to prevent `store` from being moved after `join()`). No need even for `LOCK` prefix. Your store can be delayed in CPU's write buffer (I think there was a way to force-flush it), but once it has left write buffer -- all other CPUs will observe it "immediately". But why would you even think about all of this, if you could do it in framework of C++11 memory model and get (practically) same result? – C.M. Feb 16 '19 at 23:48
@C.M. How can anything be stuck in a write buffer? – curiousguy Oct 30 '19 at 22:02
@curiousguy as far as I remember IA-32 manuals, store buffer content isn't observable by other CPUs. I.e. your `write` won't be observed by others until store buffer is flushed – C.M. Oct 31 '19 at 17:25
@C.M. Yes but the time passed in the buffer must be minuscule, right? – curiousguy Nov 01 '19 at 02:09
@curiousguy very likely, but not zero. Also, it leads to writes being "combined" or even reordered (from other CPU perspective) -- don't trust me on this, though. I read those docs like 15 years ago. Afair, there is retirement unit (RU) that makes sure modifications are observed in program order -- I have no recollection how it interacts with store buffer :). And in any case -- if you simply stay within C++11 memory model -- all of it wouldn't matter. – C.M. Nov 01 '19 at 14:16

score 1 · Answer 2 · answered Oct 30 '19 at 22:58

Is there a x86-specific way to write the requested algorithm so it is actually guaranteed to be correct?

Use a compiler that isn't broken to make sure that thread.join() is correctly treated as a "black box" function which might read or write any memory.

Thus the compiler will have to make sure that memory is "in sync with C++ abstract machine before blocking on the thread finishing. i.e. store reordering at compile time could violate the as-if rule, so compilers must not do it unless they can prove that it won't (e.g. to a local variable whose address doesn't escape the function). That isn't the case here, so the store has to happen in x86 asm before call join even for mo_relaxed.

(Or in a hypothetical implementation, a join that could fully inline would at least have a compile-time memory barrier like GNU C asm("":::"memory") or maybe atomic_thread_fence() of some strength. Anything up to acq_rel needs no asm instructions on x86, just blocking compile-time reordering.)

x86 CPU cores share a coherent view of memory through a coherent cache. (MESI protocol or equivalent). As soon as the store commits from the store buffer to L1d cache, it becomes impossible for any other core to read a "stale" value. Inter-core latency is typically 40 to 100 nanoseconds on modern x86, IIRC (when both threads are already running on different physical cores).

See Is mov + mfence safe on NUMA? and When to use volatile with multi threading? explain more about how asm can't see stale values indefinitely on real CPUs (including non-x86). So program-order at compile time is sufficient here.

This same reasoning applies for every other real CPU that a sane C++ implementation can run across using std::thread. (There are some heterogeneous system-on-chip CPUs with separate coherency domains between a microcontroller and DSP, but a quality C++ implementation's std::thread wouldn't start threads across non-coherent cores.

Or if an implementation did run on cores without coherent cache, its std::atomic would have to flush the cache line after a relaxed-atomic store. And maybe before a relaxed-atomic load. Or sync/write-back any dirty data across the whole cache before a release-store. So it would be hilariously inefficient to implement C++'s memory model on top of a non-coherent / explicit-coherency system. It would have to be coherent enough for atomic RMW to work as described (release-sequences and so on, and there only being one "live" copy of an atomic counter at any time).

This is why you wouldn't build a C++ implementation like that. You might allow starting stuff on other cores that are part of a different coherence domain (ARM non "inner-shareable"), but not via std::thread, because you need users to know the implications if you don't want to make shared variable work sanely between them.

Thanks for the detailed explanation. Am I understanding correctly that part of what you are saying is "the spec might technically allow that code to be broken, but no C++ implementation would ever implement it like that, so you shouldn't worry about it"? — Chronial, Nov 07 '19 at 10:19
@Chronial: You definitely don't have to worry about it on any real implementation. Letting other threads' loads fail to see an atomic store *indefinitely* would be considered broken by everyone. It definitely violates the non-normative note in the standard about being visible "promptly"; even a cooperative multi-tasking single-core system (infrequent switches) would do "eventually". My "you wouldn't build an implementation like that" points are about an implementation that requires *expensive manual flushing* of the whole cache after every atomic store, to ensure visibility by other threads. — Peter Cordes, Nov 07 '19 at 11:03
@Chronial: since you tagged this x86, my answer spent some time making that point that there's no way to correctly implement C++ *on x86* (or other ISA with coherent cache) where this property doesn't hold. (Unless it's actually compiling for a VM whose interpreter happens to run on x86, not really C++ function -> x86 asm function). Reordering the store at compile time with `thread.join()` would have to be considered broken because it's something that can establish a happens-before between the threads. — Peter Cordes, Nov 07 '19 at 11:07

C++11 atomics. visibility and thread.join() / correct way to stop a thread

Memory order

Visibility

2 Answers2