is this memory order correct?

Question

std::atomic<bool> x{ false };
std::atomic<bool> y{ false };

// thread 1
y.store(true, std::memory_order_seq_cst);
x.store(true, std::memory_order_release);

// thread2
while (!x.load(std::memory_order_relaxed);
assert(y.load(std::memory_order_seq_cst)); // !!!

Can the assert fail? My understanding is: while reading x is "relaxed", once "thread 2" sees the write by "thread 1" it can't see y as false because the write to y happens before the write to x.

The memory order is replicated from a real-life case and could be weaker for this sample, but I haven't changed it to not miss any subtleties.

Should it be `while (!x.load(std::memory_order_relaxed));` instead? — Quimby, Mar 20 '23 at 13:38
Typo in `assert(y.load(std::memory_order_seq_cst);` missing a bracket. Please update it to `assert(y.load(std::memory_order_seq_cst));` — SLNM, Mar 20 '23 at 13:50

score 7 · Accepted Answer · answered Mar 20 '23 at 14:02

7

Yes, ISO C++ allows the y.load to take a value from before x.load saw a true, because x.load isn't acquire or stronger.

(More formally, it doesn't create a happens-before before the writer and reader that put y.store before y.load)

On many ISAs (such as x86 or PowerPC) it wouldn't be possible to observe it in practice, x86 because LoadLoad reordering isn't allowed in general, and PowerPC because a seq_cst load involves some fencing before the instruction (perhaps to block IRIW reordering), which ends up being sufficient to block LoadLoad reordering even with earlier relaxed loads. https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html shows PowerPC's load(seq_cst) being hwsync; ld; cmp; bc; isync. hwsync is a full barrier, the same instruction they use for atomic_thread_fence(seq_cst).

But you might be able to observe this on AArch64 where LDR (load(relaxed) can reorder with a later LDAR (load(seq_cst)). To actually see it, you probably need the variables in separate cache lines, and for the loop exit branch to be correctly predicted, otherwise the later SC load won't get into the pipeline until after the spin-loop load has produced a value (and branch mispredict recovery has re-steered the front-end to fetching from the correct path).

answered Mar 20 '23 at 14:02

Peter Cordes

328,167
45
605
847

thanks Peter, arm64 is where I'm seeing this, and yes, the variables are on different cache lines. the follow-up question, if you don't mind. in the real-life case, instead of `assert`, the memory occupied by both variables is freed. is this a problem? is this only about "visibility" by "thread 2" or can "thread 1" actually cause "use after free"? – Andriy Tylychko Mar 20 '23 at 14:13
1

@AndriyTylychko: So the reader spins `relaxed` on `x`, and then frees the memory holding both `x` and `y`? I don't think it would be able to hand the memory back to the OS and make it actually fault, since CPUs don't do out-of-order exec across system call instructions. So even a relaxed load has to be actually seen, not just speculated, before we can get into the kernel and do anything to the page tables (or in fact before any later store can be non-speculative after a load). And we can't see `x` until both it and `y.store` have actually run. – Peter Cordes Mar 20 '23 at 14:23
1

@AndriyTylychko: As far as just putting the memory on a free-list and allocating it to something else, where its contents might get stepped on by `y.store`, I think that *is* perhaps possible if the thread doing the freeing is also the thread that reuses the allocation. Free and alloc will involve some stores, but store-forwarding within this core can allow that to happen in the shadow of branch speculation on the loop condition. – Peter Cordes Mar 20 '23 at 14:27
yeah, I was wondering about the distinction between "visibility" and the actual order of "thread 1". not sure I understand your last comment though, sorry. the case looks like this: https://godbolt.org/z/EoarEnKzq. how can "thread 1" reorder `y` and `x` stores, isn't the order enforced by the SC write to `y`? – Andriy Tylychko Mar 20 '23 at 14:36
@AndriyTylychko: Thread 1 can't reorder the visibility order of those stores (the order they commit to L1d cache), the second one is release or stronger. What can happen is thread 2 predicting a branch and doing a bunch of stuff, and only later seeing the `x` value that confirms this is the correct path of execution. The two stores will become visible to this thread in `y` then `x` order, but the problem is that we were able to execute a bunch of stuff after the spin-loop before we actually saw `x`. (Plausible only if branch prediction predicted that the spin-loop would run zero retries.) – Peter Cordes Mar 20 '23 at 14:50
@AndriyTylychko: In your Godbolt link, you now have thread 2 spinning on a seq_cst load. That's stronger than you need; `acquire` is sufficient; after seeing that, you can be sure that any later memory operations in this thread are ordered after both `y.store` and `x.store`. – Peter Cordes Mar 20 '23 at 14:52
that was my mistake, it should have been “relaxed”. thanks for your time – Andriy Tylychko Mar 20 '23 at 15:03

score 4 · Answer 2 · answered Mar 20 '23 at 15:07

Peter's answer is right. I just wanted to address a possible misconception behind the question.

You have to avoid reading too much into the name of the "happens before" relation. Formally, HB is just an arbitrary name given to a relation that may or may not hold between any given operation and another, following certain rules and having certain consequences. It doesn't necessarily correspond to events physically happening inside the machine.

Treat it like a mathematical definition, because that's what it is. You can only reason based on its precise formal definition, and not based on the everyday meaning of the English words "happens" and "before".

It sounds like you're interpreting "A happens before B" as meaning "A will be observed before B in all scenarios", but its implications are much more limited than that. It is true that y.store() happens before x.store(), and you can prove it from the definitions given in the C++ standard, simply because they are sequenced in that order within a particular thread (in other words, "happens before" is always consistent with program order). But it does not follow that the assert can't fail.

To get from facts about the HB relation to deductions about the actual behavior of the program, you have to make use of the coherence rules in the standard [intro.races p13-18], and these only let you draw conclusions about two relationships on the same object. They can be summarized as: if A and B are operations on the same object, and A happens before B, then B will observe the result of A. In this case, the behavior of A and B with respect to each other is "what you think it should be", as if they were executing in program order on a strictly in-order machine from the 1980s.

So in order to show that the assert can't fail, what you would need to prove is not that y.store() happens before x.store(); rather you would need to prove that y.store() happens before y.load(). And that, you will not be able to do.

Roughly speaking, the only way to link an HB relation between two threads is via synchronization, which usually results from an acquire load observing the result of a release store. You do know that the last x.load() in the loop must observe the result of the x.store(), but since the load is not acquire it does not help us.
The operations on y are both seq_cst and so in particular the store and load are respectively release and acquire, but in order to draw any conclusions from that we would have to know that y.load() observes the result of y.store(), which is exactly what we don't know and are trying to find out.

The bit about synchronization gives a hint about the fix: the load of x should be upgraded to acquire or stronger. Alternatively, you can leave it as relaxed and place an acquire fence between the loop and the y.load(), which roughly has the effect of retroactively upgrading the final x.load() to acquire.

is this memory order correct?

2 Answers2