Output 11 for this program never occurs

Question

This time i use atomic_fetch_add . Here is how i can get ra1=1 and ra2=1 . Both the threads see a.fetch_add(1,memory_order_relaxed); when a=0. The writes go into store buffer and isn't visible to the other. Both of them have ra=1 and ra2=1.

I can reason how it prints 12,21 and 22.

22 is given by both of them incrementing a in foo and bar and a=2 is visible to both a.load.
Similar 12 is given by thread foo completing and thread bar start after thread foo store.
21 is given by first bar then foo.

// g++ -O2 -pthread axbx.cpp ; while [ true ]; do ./a.out | grep "11"; done doesn't print 11 within 5 mins
#include<atomic>
#include<thread>
#include<cstdio>
using namespace std;
atomic<long> a,b;
long ra1,ra2;
void foo(){
        a.fetch_add(1,memory_order_relaxed);
        ra1=a.load(memory_order_relaxed);
}
void bar(){
        a.fetch_add(1,memory_order_relaxed);
        ra2=a.load(memory_order_relaxed);
}
int main(){
  thread t[2]{ thread(foo),thread(bar)};
  t[0].join();t[1].join();
  printf("%ld%ld\n",ra1,ra2); // This doesn't print 11 but it should
}

What is `b` intended for? – curiousguy Apr 08 '21 at 00:04 — curiousguy, Apr 08 '21 at 00:04

Peter Cordes · Accepted Answer · 2021-04-07T04:45:32.340

2

a.fetch_add is atomic; that's the whole point. There's no way for two separate fetch_adds to step on each other and only result in a single increment.

Implementations that let the store buffer break that would not be correct implementations, because ISO C++ requires the entire RMW to be one atomic operation, not atomic-load and separate atomic-store.

(e.g. on x86, lock add [a], 1 is a full barrier because of how it has to be implemented: making sure the updated data is visible in L1d cache as part of executing. Can num++ be atomic for 'int num'?.

On some other implementations, e.g. AArch64 before ARMv8.1, it will compile to an LL/SC retry loop¹, where the Store-Conditional will fail if this core lost exclusive ownership of the cache line between the load and store.)

Footnote 1: Actually current GCC will call the libatomic helper function if you omit -march=armv8.1-a or -mcpu=cortex-a76 or whatever, so it can still benefit via runtime CPU dispatching from using the new single-instruction atomics like ldadd w2, w0, [x0] instead of a retry loop, in the likely case of the code running on an ARMv8.1 CPU. https://godbolt.org/z/vhePM9h8a)

edited Apr 07 '21 at 04:45

answered Apr 05 '21 at 06:57

Peter Cordes

328,167
45
605
847

long reta = a.fetch_add(1,memory_order_relaxed); Is the assignment to a automatic storage local variable also atomic ? – nvn Apr 05 '21 at 07:36
1

@nvn: No, of course not; `reta` has type `long`, not an atomic type. Even if it was `atomic`, it would be a *separate* atomic store of the return-value temporary from the `std::atomic::fetch_add(T, mem_order)` member function. – Peter Cordes Apr 05 '21 at 07:44
I have another follow up question for you https://stackoverflow.com/questions/66953833/strange-behaviour-using-c-atomics After looking at the answer i will come back to this. – nvn Apr 05 '21 at 15:03
The processor can forward a memory write to a subsequent read from the same address under certain conditions. Store forwarding works if a write to memory is followed by a read from the same address when the read has the same operand size which means if it weren't for fetch_add but for store it could have been reordered ? https://easyperf.net/blog/2018/03/09/Store-forwarding#store-to-load-forwarding , it contradicts with "Reads may be reordered with older writes to different locations but not with older writes to the same location." – nvn Apr 07 '21 at 04:02
https://stackoverflow.com/questions/24176599/what-does-store-buffer-forwarding-mean-in-the-intel-developers-manual – nvn Apr 07 '21 at 04:05
1

@nvn: That Intel wording of "not with older writes to the same location" has come up before. It's very misleading; the only interpretation of that which is consistent with observed reality (and other formal on-paper x86 memory models) is that it's just stating the obvious fact that a thread can see *it's own* recent stores. i.e. you read the value from your own stores. It's *not* saying anything about what *other* cores will observe. (As @ Bee commented on [Can x86 reorder a narrow store with a wider load that fully contains it?](https://stackoverflow.com/posts/comments/75186763)) – Peter Cordes Apr 07 '21 at 04:34
1

@nvn: See also [Globally Invisible load instructions](https://stackoverflow.com/q/50609934) for a more general discussion of exactly where load data can come from (and when). The x86 memory model (for "normal" (not NT) loads/stores on normal (WB) memory) can be summarized (exactly AFAIK) as program-order plus a store buffer with store forwarding. – Peter Cordes Apr 07 '21 at 04:37
Store forwarding works if a write to memory is followed by a read from the same address when the read has the same operand size which means if it weren't for **fetch_add** in my code but for store it could have been reordered ? https://easyperf.net/blog/2018/03/09/Store-forwarding#store-to-load-forwarding – nvn Apr 07 '21 at 07:01
1

@nvn: Yes, each thread will normally just reload it's own `1`, only rarely seeing the `1` stored by the other thread. But what do you mean "re"ordered? Even if you have foo and bar store separate values, you don't know which ran first. So your experiment can't detect a "later" read reading the value from the first write (which will get overwritten). – Peter Cordes Apr 07 '21 at 07:08
1

@nvn: Thanks for that link, though. I'd always assumed that multiple slow-path store forwardings could be in flight (a store forwarding stall doesn't wait for the store to commit to cache), but I'd never tried the experiment. It seems that might not be the case, so the name is more appropriate than I thought. It may stall the store-forwarding machinery until this store-forwarding completes. It'd be worth trying to see if successful fast-path forwarding can happen in the shadow of a stall. Of course that performance issue isn't really relevant to memory-ordering. – Peter Cordes Apr 07 '21 at 07:12

Output 11 for this program never occurs

1 Answers1