What are the sequencing guarantees around relaxed atomic memory accesses within a single thread?

Question

I have been exploring world of C++ for a long time and I was interested in following question. I'm interested in formal answer (with links to C++ standard confirming your answer). I hope you will be interested in this not simple question :)

So A is some global std::atomic<int> variable. Other threads only read it.

void foo() {
  // here A != 42
  A.store(42, std::memory_order::relaxed);
  auto a = A.load(std::memory_order::relaxed);
  if (a != 42) assert("What!?")
}

Some thread calls foo(). What language rules guarantee that A.store(42, ...) happen before (more correctly, sequenced before) A.load(...) within the thread that called `foo'?

And now lets add certain B to problem condition. It's some global std::atomic<int> variable too.

And now the foo() function looks like this:

void foo() {
  // here A != 42
  A.store(42, std::memory_order::relaxed);
  B.store(0, std::memory_order::seq_cst);
  auto a = A.load(std::memory_order::relaxed);
  if (a != 42) assert("What!?")
}

On x64, I can be sure that such code is equivalent to (except that we don't change B :D):

void foo() {
  // here A != 42
  A.store(42, std::memory_order::relaxed);
  std::atomic_thread_fence(std::memory_order::seq_cst);
  auto a = A.load(std::memory_order::relaxed);
  if (a != 42) assert("What!?")
}

But is there such guarantee from point of view of C++ language standard?

These are not equivalent. `atomic_thread_fence` imposes more requirements than single atomic variables' memory instructions do. Can't tell if it manifests on x64 in anyway. — ALX23z, May 19 '23 at 09:43
@ALX23z Do you want to say that dummy ` B.store(0, std::memory_order::seq_cst);` between `relaxed` operations on x64 will not add a guarantee that `A.store` happebns before `A.load`? As far as I know, this will result in flushing of all buffers and synchronization at the shared cache level — Alexander Spichak, May 19 '23 at 09:55
A single thread always sees its own operations in program order, that's what "sequenced before" is all about. If later loads might not see earlier stores in asm, the compiler would need to emit fences before most loads or after most stores; it'd be madness. That's why CPUs maintain the illusion of code executing in program order. (And why C++ won't reorder operations on the same object.) — Peter Cordes, May 19 '23 at 09:55
@PeterCordes Yes, you probably answered my question. Thanks you:) — Alexander Spichak, May 19 '23 at 09:57
@AlexanderSpichak `atomic_thread_fence` interacts with other atomic variables. Say you have an atomic variable `C` that trigerred `release` instruction in some other thread, the fence will impose synchronization with it, while the operation on atomic `B` will not synchronize with `C` (unless `C` trigerred seq_cst memory instruction). — ALX23z, May 19 '23 at 10:01
[Can relaxed memory model reorder on same thread?](https://stackoverflow.com/q/70442119) explains that this thread will see its own store. This is possible regardless of the order which other threads see the store; the thread that did the store can do store-forwarding from its own [store buffer](https://stackoverflow.com/questions/64141366/can-a-speculatively-executed-cpu-branch-contain-opcodes-that-access-ram) to see the store before it becomes globally visible, so no, this ordering does not imply that there's an `atomic_thread_fence(seq_cst)`!! — Peter Cordes, May 19 '23 at 10:03
Also related: [Globally Invisible load instructions](https://stackoverflow.com/q/50609934) and [With memory\_order\_relaxed how is total order of modification of an atomic variable assured on typical architectures?](https://stackoverflow.com/q/58827774) / [Understanding the memory ordering for C++ atomics in a single thread](https://stackoverflow.com/q/64476918) — Peter Cordes, May 19 '23 at 10:04
@ALX23z It looks like I was wrong. Thanks for detailed answer! — Alexander Spichak, May 19 '23 at 10:05

score 2 · Answer 1 · answered May 20 '23 at 16:31

What language rules guarantee that A.store(42, ...) happen before (more correctly, sequenced before) A.load(...) within the thread that called `foo'?

This one actually is simple :) See [intro.execution] p9: "Every value computation and side effect associated with a full-expression is sequenced before every value computation and side effect associated with the next full-expression to be evaluated."

A.store(42, std::memory_order::relaxed) is a full-expression under [intro.execution] p5.6 ("an expression that is not a subexpression of another expression and that is not otherwise part of a full-expression.").
auto a = A.load(std::memory_order::relaxed); is an init-declarator and thus a full-expression under p5.4.

Sequenced-before is program order, plain and simple. This is not the subtle part of memory ordering. Your A.store absolutely happens-before your A.load, and your assert can never ever fail.

The C++ memory model doesn't change the semantics of single-threaded programs - they are still the "natural" semantics - and it doesn't change the semantics of accesses within a single thread. Otherwise it'd be nearly impossible to program at all.

As another way to look at it: if you had written

int b;
b = 42;
int a = b;
assert(a == 42);

you would not even be asking the question, right? The sematics of atomic variables are strictly stronger than of non-atomic variables, even with relaxed ordering. So anything that works (i.e. is well-defined) with non-atomic variables will still work if they are upgraded to atomic, no matter what memory_order you use.

The place where confusion sometimes arises is in realizing what happens-before really means. Some people, when they realize this, thinks that it makes memory ordering trivial, because they think that "X happens before Y" tells you that "X will always be observed before Y". It doesn't mean that. What it tells you is just that "Y will observe X".

As to your second question, no, in general an unrelated seq_cst access does not imply a seq_cst fence.

Not even on x86, for that matter. On x86, StoreStore, LoadLoad and LoadStore reordering are already impossible, so seq_cst just has to prevent StoreLoad reordering. This can be accomplished simply by ensuring that between every seq_cst store and every seq_cst load, there is at least one barrier instruction (e.g. mfence, though on x86 an unrelated locked RMW instruction actas as a barrier too). A compiler can do this in two ways:

A seq_cst load emits just an ordinary load; a seq_cst store emits an ordinary store followed by a barrier.
A seq_cst load emits a barrier followed by an ordinary load; a seq_cst store emits just an ordinary store.

So with a compiler following strategy #2, your store to B would just be an ordinary store instruction, with no barrier in sight. It has its usual release semantics, but could still be reordered with later instructions, such as your A.load().

Now, in your program that doesn't have much significance. And for that matter, neither would the fence. Putting a fence or other barrier between accesses to the same variable doesn't really achieve anything. If A were the only variable in your program shared between threads, then adding or removing the fence (or the B.store()) would not change the program's possible behaviors in any way. Fences are only useful when you place them between accesses to different variables. If there were other accesses in your program not shown, then putting a fence on that line could make a difference, but we would have to see the rest of the program to be able to say more.

For example, suppose you had

A.store(42, std::memory_order_relaxed);
B.store(0, std::memory_order_seq_cst);
C.store(17, std::memory_order_relaxed);

Since seq_cst implies release, the A and B stores cannot be reordered with each other. However, C can be reordered before B, and then before A. So another thread doing c = C.load(acquire); a = A.load(acquire); can get c == 17 && a == 0. This is even possible on x86, because since the C++ memory model allows the behavior, the compiler is allowed to do the reordering and emit mov [c], 17 ; mov [a], 42; mov [b], 0. But if in place of the B.store(), you put a release or seq_cst fence, then c == 17 && a == 0 is no longer possible on any conforming implementation.

Nice answer, I like your point that `relaxed` isn't ever weaker than non-atomic; good way to explain it. BTW, the actual implementation strategy current compilers use for x86 seq_cst stores isn't a separate barrier, it's to do the store itself with `xchg` instead of `mov`, ignoring the xchg "return value". That's equivalent but faster the `mov` + a dummy `lock add byte [rsp], 0` or `mov` + mfence`. See [Why does a std::atomic store with sequential consistency use XCHG?](https://stackoverflow.com/q/49107683) — Peter Cordes, May 20 '23 at 21:30

What are the sequencing guarantees around relaxed atomic memory accesses within a single thread?

1 Answers1