Can't relaxed atomic fetch_add reorder with later loads on x86, like store can?

Question

This program will sometimes print 00, but if I comment out a.store and b.store and uncomment a.fetch_add and b.fetch_add which does the exact same thing i.e both set the value of a=1,b=1 , I never get 00. (Tested on an x86-64 Intel i3, with g++ -O2)

Am i missing something, or can "00" never occur by the standard?

This is the version with plain stores, which can print 00.

// g++ -O2 -pthread axbx.cpp  ; while [ true ]; do ./a.out  | grep "00" ; done
#include<cstdio>
#include<thread>
#include<atomic>
using namespace std;
atomic<int> a,b;
int reta,retb;

void foo(){
        //a.fetch_add(1,memory_order_relaxed);
        a.store(1,memory_order_relaxed);
        retb=b.load(memory_order_relaxed);
}

void bar(){
        //b.fetch_add(1,memory_order_relaxed);
        b.store(1,memory_order_relaxed);
        reta=a.load(memory_order_relaxed);
}

int main(){
        thread t[2]{ thread(foo),thread(bar) };
        t[0].join(); t[1].join();
        printf("%d%d\n",reta,retb);
        return 0;
}

The below never prints 00

// g++ -O2 -pthread axbx.cpp  ; while [ true ]; do ./a.out  | grep "00" ; done
#include<cstdio>
#include<thread>
#include<atomic>
using namespace std;
atomic<int> a,b;
int reta,retb;

void foo(){
        a.fetch_add(1,memory_order_relaxed);
        //a.store(1,memory_order_relaxed);
        retb=b.load(memory_order_relaxed);
}

void bar(){
        b.fetch_add(1,memory_order_relaxed);
        //b.store(1,memory_order_relaxed);
        reta=a.load(memory_order_relaxed);
}

int main(){
        thread t[2]{ thread(foo),thread(bar) };
        t[0].join(); t[1].join();
        printf("%d%d\n",reta,retb);
        return 0;
}

Look at this as well Multithreading atomics a b printing 00 for memory_order_relaxed

Keep in mind that “I wasn’t able to make X happen in a certain amount of time with one particular test on one particular type of hardware” doesn’t mean that X cannot happen or that X is disallowed by the standard. — Jeremy Friesner, Apr 05 '21 at 13:35
See https://en.cppreference.com/w/cpp/atomic/memory_order#Explanation. — G. Sliepen, Apr 05 '21 at 13:46
I wanted to know if there was something special of fetch_add other than store. Probably on some hardware like arm64/ppc64 it can occur. — nvn, Apr 05 '21 at 13:51
RISC processors rarely have a native "atomic read/modify/write" instruction, so `fetch_add` would typically be implemented as a loop. — Raymond Chen, Apr 05 '21 at 14:00
https://en.cppreference.com/w/cpp/atomic/memory_order#Relaxed_ordering explains precisely your problem. — n. m. could be an AI, Apr 05 '21 at 14:58
Also you might find [this talk](https://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2) interesting. It's rather long but explains relaxed memory ordering very well. Notice how C++ memory ordering semantics do not really depend on any specific CPU architecture. — rustyx, Apr 06 '21 at 15:03

Peter Cordes · Accepted Answer · 2021-04-06T14:52:41.097

The standard allows 00, but you'll never get it on x86 (without compile-time reordering). The only way to implement atomic RMW on x86 involves a lock prefix, which is a "full barrier", which is strong enough for seq_cst.

In C++ terms, atomic RMWs are effectively promoted to seq_cst when compiling for x86. (Only after possible compile-time ordering is nailed down - e.g. non-atomic loads / stores can reorder / combine across a relaxed fetch_add, and so can other relaxed operations, and one-way reordering with acquire or release operations. Although compilers are less likely to reorder atomic ops with each other since they don't combine them, and doing so is one of the main reasons for compile-time reordering.)

In fact, most compilers implement a.store(1, mo_seq_cst) by using xchg (which has an implicit lock prefix), because it's faster than mov + mfence on modern CPUs, and turning 0 into 1 with lock add as the only write to each object is exactly identical. Fun fact: with just store and load, your code will compile to the same asm as https://preshing.com/20120515/memory-reordering-caught-in-the-act/, so the discussion there applies.

ISO C++ allows the whole relaxed RMW to reorder with the relaxed load, but normal compilers won't do that at compile-time for no reason. (A DeathStation 9000 C++ implementation could/would). So you've finally found a case where it would be useful to test on a different ISA. The ways in which an atomic RMW (or even parts of it) can reorder at run-time depend a lot on the ISA.

An LL/SC machine that needs a retry loop to implement fetch_add (for example ARM, or AArch64 before ARMv8.1) may be able to truly implement a relaxed RMW that can reorder at run-time because anything stronger than relaxed would require barriers. (Or acquire / release versions of the instructions like AArch64 ldaxr / stlxr vs. ldxr/stxr). So if there's an asm difference between relaxed and acq and/or rel (and sometimes seq_cst is also different), it's likely that difference is necessary and preventing some run-time reordering.

Even a single-instruction atomic operation might be able to be truly relaxed on AArch64; I haven't investigated. Most weakly-ordered ISAs have traditionally used LL/SC atomics, so I might just be conflating those.

In an LL/SC machine, the store side of an LL/SC RMW can even reorder with later loads separately from the load, unless they're both seq_cst. For purposes of ordering, is atomic read-modify-write one operation or two?

To actually see 00, both loads would have to happen before the store part of the RMW was visible in the other thread. And yes, the HW reordering mechanism in an LL/SC machine would I think be pretty similar to reordering a plain store.

Didn't the latest version of powerpc introduce atomic fetch_add instructions? (I thought I saw them, but at least gcc doesn't use them yet, so not sure) Not important, but since I noticed the "before ..." for aarch64... — Marc Glisse, Apr 05 '21 at 20:48
@MarcGlisse: Oh IDK, it would make sense for other ISAs that want to scale to huge numbers of cores to introduce single-instruction RMWs as an alternative to LL/SC retry loops, because it certainly helped for AArch64 CPUs like Graviton 2. (https://www.anandtech.com/show/15578/cloud-clash-amazon-graviton2-arm-against-intel-and-amd/2). If anyone else was going to, POWER wouldn't surprise me at all. An long list wasn't necessary so I just dropped mention of PowerPC to avoid distraction. Without any `-march=` options, you do get LL/SC from PowerPC GCC, in case anyone's wondering. — Peter Cordes, Apr 05 '21 at 22:05
Say now reordering could happen but we want to prevent it across all architectures. i tried inserting atomic_thread_fence(memory_order_seq_cst) in both threads between store and load. It prevents re-ordering but memory_order_acquire_release doesn't. Is memory_order_seq_cst the weakest fence you can put in ? — nvn, Apr 06 '21 at 03:49
@nvn: yes, I think either both operations themselves need to be `seq_cst`, or you need a `seq_cst` fence between them (as discussed [in comments on @Christophe's answer](https://stackoverflow.com/posts/comments/118360633)). Remember, the asm needs to block StoreLoad reordering, and only full barriers do that, not just release and/or acquire (e.g. PowerPC `lwsync`). https://preshing.com/20130922/acquire-and-release-fences/. In C++, a full barrier is how compilers impelement `atomic_thread_fence(mo_seq_cst)`, but on x86 all other stdatomic fences are a no-op. — Peter Cordes, Apr 06 '21 at 03:56
This answer is misleading. "*you'll never get it on x86*" - that depends on the compiler actually. A compiler is allowed to reoder `a.load(memory_order_relaxed)` before `b.store(1,memory_order_relaxed)` (and the same for `b.load` vs. `a.store`) if for whatever reason it decides that that's faster (can happen once the codebase gets larger). The output can then become 00. — rustyx, Apr 06 '21 at 09:24
@rustyx: I did say that later in the answer, but the answer has gotten long enough that it's a good idea to repeat that caveat right at the top, as part of the summary opening line. — Peter Cordes, Apr 06 '21 at 14:54

Christophe · Answer 2 · 2021-04-06T10:19:15.470

2

The key in this question is to realize that relaxed memory ordering gives you no guarantee of synchronization between the threads:

Atomic operations tagged memory_order_relaxed are not synchronization operations; they do not impose an order among concurrent memory accesses. They only guarantee atomicity and modification order consistency.

Thus in the first code, different scenario could happen. For example:

code in thread for foo() is executed first, then thread for bar(): retb is 0, reta is 1 so you'll get 10.
code in thread for bar() is executed first, then thread for foo(): reta is 0, retb is 1 so you'll get 01.
code in thread for foo() and bar() are executed instruction by instruction at the same time. Then reta and retb would be both 1, and you'll get 11.
Relaxed memory ordering also allows for unsynchronised situations: where both threads update their atomic and see their current value of the atomic, but see the unsyncronized value of the other thread (i.e. the value before the atomic change). So you can reta and retb at 0 geting you 00.

The second code suffers from the same problem, since it's relaxed ordering and the access used to set reta and retb are read only accesses on atomics modified in another thread. You could have all the four possibilities.

If you want to ensure synchronisation to happen as expected, you need to ensure a global order between all the atomic operation and thus go with memory_order_seq_cst. This would rule out the 00 but still leave all other combinations possible.

(Note: my former suggestion to use memory_order_acquire is indeed not sufficient, since it still guarantees no order across the threads for operations on different atomics, as explained by Peter in the comments)

edited Apr 06 '21 at 10:19

answered Apr 05 '21 at 17:02

Christophe

68,716
7
72
138

1

Are you suggesting that `reta = a.exchange(1, mo_relaxed)` would rule out `00`? I don't think that's correct; within each thread, it would still be possible for `a.exchange` to reorder with `b.fetch_add` and have both exchanges happen before either fetch_Add. – Peter Cordes Apr 05 '21 at 20:17
1

Also, I think the point of the question was why `00` isn't observed in practice with the OP's fetch_add code, on the OP's machine. Part of that was inferred by context of the OP's previous questions; it wasn't clearly stated in the question itself until I edited. You correctly explain that ISO C++ allows it, but not why it doesn't happen on x86. – Peter Cordes Apr 05 '21 at 20:19
1

Yeah, after reading the linked answer from Anthony, that claim about "latest value" from RMW operations is a bit more clear. It's a possibly misleading way to phrase things, and it's only talking about the modification order of a single atomic variable. Here there is one modification each of two separate variables. (Or with read replaced with RMW, one of each var in each thread). So you're misinterpreting Anthony's phrasing. To prevent 00 with pure ISO C++ rules, I think you'd need `atomic_thread_fence(mo_seq_cst)` before the load in each thread, after the store or fetch_add. – Peter Cordes Apr 05 '21 at 20:26
@PeterCordes No I don’t suggest that it tules out 00, on contrary: In my four bullets I give example scenario that could explain any of the four possible outcomes, including 00 (see last bullet). I also explain that 00 can be ruled out with relaxed mo only if the code would be changed to use a read-modify-write instead of a load (see details in Antony’s linked answer). – Christophe Apr 05 '21 at 21:03
1

I was talking about exchange replacing just the load, note the `reta = ...`. So it seems you *are* suggesting that would rule out `00`. The original code uses `b.store(1,memory_order_relaxed);` / `reta=a.load(memory_order_relaxed);`. If you replace `reta=a.load` with `reta=a.exchange`, that's still unordered wrt. `b.store` because they're both relaxed and on *different* objects. As I said, you're misinterpreting Anthony's answer (because it's phrased misleadingly.) – Peter Cordes Apr 05 '21 at 21:08
2

@PeterCordes I also find it annoying that an architecture neutral question, suddenly is changed to become an x86 specific one, even if your x86 details are very interesting: OP must understand that when working with c++ concurrency, even if one specific situation is not observed, it’s not sufficient to rule it out (which was the purpose of my answer) – Christophe Apr 05 '21 at 21:11
1

Yeah, it's not a great question in that respect, and/or has two aspects. This would be a good answer to the pure ISO C++ part, if not for the incorrect claim about a relaxed `a.exchange` only being able to happen after a `b.store` is globally visible, or whatever other mechanism you think would give that guarantee. That's of course true on x86 (because RMWs have to get promoted to full barriers), but nothing in the ISO C++ standard guarantees anything like that. I'd like to upvote this answer once you fix that problem. – Peter Cordes Apr 05 '21 at 21:15
@PeterCordes ok. I had the incrementing in mind. But I think I got it: Antony’s answer made me over-optimistic. So even if incrementing the value, there would be no ordering guarantee since it’s on two different atomics. I updated the last para of my answer for a more traditional release/acquire approach. Let me know if it’s ok. – Christophe Apr 05 '21 at 21:43
Nope, you need both store and load to be `seq_cst`, otherwise StoreLoad reordering is allowed. With just acquire, it could compile to plain store / plain-load, producing the x86 asm from https://preshing.com/20120515/memory-reordering-caught-in-the-act/. Always remember that an acquire-load is only a 1-way barrier (https://preshing.com/20120913/acquire-and-release-semantics/). – Peter Cordes Apr 05 '21 at 21:45
Even an acquire barrier (`thread_fence(mo_acquire)`) wouldn't be sufficient because it wouldn't order an earlier store wrt. a later load. Only a "full barrier" (mo_seq_cst) does that. https://preshing.com/20130922/acquire-and-release-fences/. – Peter Cordes Apr 05 '21 at 21:49
Note that all this reasoning is done after mapping ISO C++ rules to CPU-architecture "reordering". ISO C++ defines ordering guarantees in terms of synchronizes-with, and existence of orders that are compatible with sequenced-before (for seq_cst) and so on. I think I understand both well enough to say that this is valid. Certainly it's sufficient for cases where you just need a counterexample, e.g. this C++ can compile like so on some machine, and that machine allows this. Unless you've discovered a bug in C++->asm mappings, that's a valid way to show that C++ must allow something. – Peter Cordes Apr 05 '21 at 21:52
1

@PeterCordes Ok. Now with your articles and after re-reading `[intro.multithreading]` and `[atomics.order]` I see your point. – Christophe Apr 06 '21 at 10:25

score -1 · Answer 3 · answered Apr 05 '21 at 14:05

I get "10" in both cases. The first thread will always run faster and a == 1! But if you add additional operations to foo()

#include<cstdio>
#include<thread>
#include<atomic>
using namespace std;
atomic<int> a,b;
int reta,retb;

void foo(){

    int i=0;
    while(i < 10000000)
        i++;

    a.fetch_add(1,memory_order_relaxed);
    //a.store(1,memory_order_relaxed);
    retb=b.load(memory_order_relaxed);
}

void bar(){
    b.fetch_add(1,memory_order_relaxed);
    //b.store(1,memory_order_relaxed);
    reta=a.load(memory_order_relaxed);
}

int main(){
    thread t[2]{ thread(foo),thread(bar) };
    t[0].join(); t[1].join();
    printf("%d%d\n",reta,retb);
    return 0;
}

You will receive "01"!

Is it possible to print 00 ? If you comment out the fetch_add and uncomment store , i get 00. But fetch_add never gives me 00. — nvn, Apr 05 '21 at 14:26
*The first thread will always run* - not necessarily. If you run it enough times, you should see 01 eventually, even with the original code without a delay loop. Anyway, that's not the question. — Peter Cordes, Apr 05 '21 at 20:14

Can't relaxed atomic fetch_add reorder with later loads on x86, like store can?

3 Answers3

Linked