How do I synchronize a store before a load in multiple threads?

Question

Consider the following program:

#include <thread>
#include <atomic>
#include <cassert>

int x = 0;
std::atomic<int> y = {0};
std::atomic<bool> x_was_zero = {false};
std::atomic<bool> y_was_zero = {false};

void write_x_load_y()
{
    x = 1;
    if (y == 0)
        y_was_zero = true;
}

void write_y_load_x()
{
    y = 1;
    if (x == 0)
        x_was_zero = true;
}

int main()
{
    std::thread a(write_x_load_y);
    std::thread b(write_y_load_x);
    a.join();
    b.join();
    assert(!x_was_zero || !y_was_zero);
}

Given the constraints that everything can be atomic except access to x, how can I guarantee that the assert passes?
If that's not possible as-is, is it possible if access to x can be atomic but no stronger than "relaxed"?
What is the least amount of synchronization (e.g. weakest memory models for all operations) necessary to guarantee this?

It's my understanding that without any form of fences or atomic access, it's possible (if only theoretically so) for the store x = 1 to sink below the load y == 0 (having been moved by the CPU if not per se by the compiler), causing a potential race where both x and y are 0 (and triggering that assertion).

I was initially under the naïve impression that SEQ_CST guarantees total ordering of non-atomic variables. That is, a non-atomic (or relaxed) store of x ordered before a SEQ_CST load of y is guaranteed to actually happen first; similarly a SEQ_CST store of y ordered before a non-atomic (or relaxed) load of x is guaranteed to actually happen first; put together that would prevent the race. However, on further reading of https://en.cppreference.com/w/cpp/atomic/memory_order, I don't think the documentation actually says this, but rather that such ordering is only guaranteed for the opposite case (loads before stores), or cases where access to both x and y are SEQ_CST.

Similarly, I naïvely had thought that a memory barrier would force all loads OR stores before the barrier to happen before all loads OR stores after it, but reading https://en.cppreference.com/w/cpp/atomic/atomic_thread_fence seems to imply that it's again only true for forcing ordering of loads before the barrier with stores after it. That doesn't help here either, I think, unless I'm supposed to put barriers in a less obvious place than "between the store and the load".

What synchronization method should I use here? Is it even possible?

Peter Cordes · Accepted Answer · 2021-01-10T06:02:27.903

This idea is fatally flawed, and impossible to make safe in ISO C++ with non-atomic x. Data-race Undefined Behaviour (UB) is unavoidable because one thread writes x unconditionally and the other reads it unconditionally.

At best you'd be rolling your own atomics by using compiler barriers to force one thread to sync actual memory state with abstract-machine state. But even then, rolling your own atomics without volatile is not very safe: https://lwn.net/Articles/793253/ explains why the Linux kernel's hand-rolled atomics use volatile casts for pure-store and pure-load. This gives you something like relaxed-atomic on normal compilers, but of course zero guarantee from ISO C++.

When to use volatile with multi threading? basically never- you can get the same efficient asm from using atomic<int> with mo_relaxed. (Or on x86, even acquire and release are free in asm.)

If you were going to attempt this, in practice on most implementations, std::atomic_thread_fence(std::memory_order_seq_cst) will block compile-time reordering of non-atomic operations across it. (e.g. in GCC I think it's basically equivalent to x86 asm("mfence" ::: "memory")¹ which blocks compile-time reordering and is also a full barrier. But I think some of that "strength" is an implementation-detail and not required by ISO C++.

Footnote 1: BTW, usually you want a dummy lock add with stack memory, not actual mfence, because mfence is slower.

Semi-related: Your bool variables don't need to be atomic. IDK if it's more or less distracting to make them atomic; I was leaning towards being simpler if they're not. They're each written by at most 1 thread, and only read after that thread has been joined. You could make them plain bool, and also write them unconditionally like y_was_zero = (y == 0); if you want. (But that's neutral as far as simplicity, although saves looking at their initializers).

What is the least amount of synchronization (e.g. weakest memory models for all operations) necessary to guarantee this?

x needs to be atomic<> and both stores need to be seq_cst. (This is basically equivalent to draining the store buffer after doing the store).

Like in https://preshing.com/20120515/memory-reordering-caught-in-the-act/

In practice I think both loads can be relaxed on most machines (maybe not POWER though where private store-forwarding is possible). For ISO C++ to guarantee it I think you need seq_cst on both loads as well, so all 4 operations are part of a global total order of operations across multiple objects that's compatible with program order. There's no synchronizes-with via release/acquire to create a happens-before relationship.

Generally seq_cst is the only ordering in the ISO C++ memory model that must translate to blocking StoreLoad reordering in a memory model based on the existence of an actual coherent state that exist even if nobody's looking at it, and individual threads accessing that state with local reordering. (ISO C++ only talks about what other threads can observe, and hypothetical observers in theory might not constrain code-gen. But in practice they do because compilers don't do whole-program inter-thread analysis.)

If you for some reason can't make `x` be `atomic<>`

Use C++20 atomic_ref<> to construct a reference to x that you can use to do xref.store(1, mo_seq_cst) or xref.load(mo_seq_cst).

Or with GNU C/C++ atomic builtins, __atomic_store_n(&x, 1, __ATOMIC_SEQ_CST) (which is exactly what C++20 atomic_ref is designed to wrap.)

Or with semi-portable stuff, *(volatile int*)&x = 1; and a barrier, which might or might not work, depending on the compiler. A DeathStation 9000 can certainly make volatile int assignment non-atomic if it wants to. But fortunately the compilers people choose to use in real life aim to not be terrible, and often to be usable for low-level systems programming. Still, this is not at all guaranteed by anything to work.

That is unfortunate; I guess the C++ standard really is just lacking in that respect. I guess the practical answer is `__atomic_thread_fence( __ATOMIC_SEQ_CST )` then. It is tricky because I do care about more than just x86, if not per se any machine in existence, and the idea of using an API that isn't actually guaranteed to work scares me. (I figured the atomic bools were probably redundant, but left them in because I didn't want to think too hard about it.) — ObsequiousNewt, Jan 10 '21 at 05:45
@ObsequiousNewt: what are you hoping to gain from not making `x` atomic? Maybe you're looking for C++20 `atomic_ref` to do safe seq_cst access to it here, but still be able to access it non-atomically in other parts of your code? Yes, that was a weakness in C++ before C++20. (Even C++20 still doesn't make efficient seq-locks possible. There's no UB-free way to do loads that *maybe* have tearing, and then only use the memcpy result after you check for safety. In practice you get safe portable asm if you do it right, but it's still annoying.) — Peter Cordes, Jan 10 '21 at 05:48
In essence, I don't have control over `x`. (This problem is a minimally simplified version of a futex-like operation, where `x` is the futex itself, and `y` basically represents a thread being queued into a lock-free array. As such I can actually make the read of `x` atomic, but not the write.) — ObsequiousNewt, Jan 10 '21 at 05:54
@ObsequiousNewt: Ok, then use C++20 atomic_ref<> to construct a reference to `x` that you can use to do `xref.store(1, mo_seq_cst)` or `xref.load(mo_seq_cst)`. Or with GNU C/C++, `__atomic_store_n(&x, 1, __ATOMIC_SEQ_CST)` (which is exactly what C++20 atomic_ref is designed to wrap.) Or with semi-portable stuff, `*(volatile int*)&x = 1;` and a barrier. — Peter Cordes, Jan 10 '21 at 05:58
Right, the problem is I can't do that, I'm not the one writing to `x`. I can give more context if it helps, but suffice it to say that I can't guarantee that a write to `x` is atomic in any sense. — ObsequiousNewt, Jan 10 '21 at 06:01
@ObsequiousNewt: If the store to `x` is in separately-compiled code, a non-inline function call does effectively create a compiler barrier. An `atomic_thread_fence(seq_cst)` should then be sufficient. (Unless link-time optimization can do cross-file inlining to allow compile-time reordering.) This is somewhat safe; in practice if `x` is "naturally atomic" on the target machines you care about (a `long` or smaller), you shouldn't get tearing. And if you don't care about later reloads of `x` or other things before control returns to your code, you can barrier early enough. — Peter Cordes, Jan 10 '21 at 06:06
@ObsequiousNewt: But be sure to read https://lwn.net/Articles/793253/ to get some ideas about all the ways `x = 1;` can compile, e.g. to unconditionally storing `0` then maybe storing `1` if the compiler was so inclined. Although that's more of a theoretical possibility that's unlikely in practice. As long as you control reading plain `x`, you can to avoid the more likely problems like turning one read into multiple reads, instead of loading once into a temporary and spilling/reloading it when necessary. — Peter Cordes, Jan 10 '21 at 06:08
@ObsequiousNewt: So anyway, your actual SO question isn't very helpful to the hack you want to create: the answer to your SO question is obviously to make `x` atomic, and the details of what you can somewhat-safely do in practice depend strongly on the other details of your problem. (This all sounds like a bad idea that's unlikely to pass a smell test well enough to make it worth whatever you're hoping to gain here. Probably better to make a copy of the library and modify it if you really want to do this safely.) — Peter Cordes, Jan 10 '21 at 06:11
Yeah, I'm basically already safe as far as compiler reordering goes; I'm mostly concerned about CPU reordering. I think as long as a seq_cst fence guarantees that in practice (if not strictly according to the spec) I'll just use that. (I can provide more details as to why I really do need this behaviour even though it's a bad idea, but I suspect an extended comment chain is not the place to do that.) — ObsequiousNewt, Jan 10 '21 at 06:15
@ObsequiousNewt: Yeah, compilers don't optimize atomics (at all for now), and a seq_cst fence does reliably compile to include a StoreLoad barrier. And will definitely block compile-time reordering with later atomic accesses, and typically also non-atomic accesses to non-private memory. (At least as an implementation detail.) You can kind of think of `atomic_thread_fence(seq_cst)` as equivalent to an asm full barrier, but you do have to be careful with reasoning that way. — Peter Cordes, Jan 10 '21 at 06:17

How do I synchronize a store before a load in multiple threads?

1 Answers1

If you for some reason can't make x be atomic<>

If you for some reason can't make `x` be `atomic<>`