TL;DR yes, the barrier concerns the whole thread/program, regardless of any functions calls.
I feel like you might be mixing two things.
Let's have two threads execute a each some sequence of read and write instructions somehow interleaved.
Then for the same address A
, value X
and for instructions write(A, 'X'); y = read(A)
there are basically two cases according to the C++ memory model:
- a) If both instructions execute on the same thread,
read
is guaranteed to return 'X'
->y=='X'
.
- b) If the instruction happen in different threads, there are no guarantees, it is undefined behaviour unless synchronized explicitly through some synchronization primitives.
In other words, how the compiler generated the sequence of instructions is kind of irrelevant to you - it either just works or you should not be doing it.
The compiler can reorder both C++ statements and the corresponding CPU instructions as it sees fit as long as the observable result is the same as the sequential execution as per C++ rules of evaluating expressions and statements. But as long as you cannot observe the difference, the compiler can to almost anything it wants.
Of course the compiler can never reorder what it cannot into see because it might have well-defined observable side effect. Therefore calls with virtual
, across TUs without -flto
, to shared libraries are not reordered. But relying on this for observability across threads is undefined behaviour.
All of that happens inside the C++ machine model, none of that gives you any guarantees on what sort of CPU instructions are executed at all.
Furthermore, C++ explicitly gives no promises how the sequence of CPU instructions is observable from any other thread or from the outside world for that matter unless explicitly synchronized. If the compiler observes that writing to some memory location is redundant because the thread/program itself cannot tell the difference, it does not have to write anything. For example:
int* ptr = ...
*ptr=42;
int x = *ptr;
// Can be just replaced with and thus no memory is written to at all
int x = 42;
You are not saying, write to 42
to the memory, no, you are saying the program must behave as if you have written it to the memory and unless ptr
is synchronized across threads, the compiler will not care about other threads' accesses to ptr
at all.
Going on, C++ memory model operates by default on per-thread basis with only a specific set of primitives (atomics, locks, barriers...) which can be accessed from multiple threads. Only for them, the access is synchronized and therefore it is only for them where visibility of CPU read/write instructions play any role at all, and it is only around them where the visibility of effect from all other instructions is defined.
The details are on cppreference but the idea is that any access to the shared primitive
can be used to constraint the observability of the executed CPU instructions in multiple threads.
Operations on the shared primitive can force C++ to constraint the reordering of the generated CPU instructions to the rules of C++ evaluation order.
For example for the following shared variables
int x = 0;
std::atomic_bool a;
and two functions called in parallel and executed in the commented order
void thread1(){
x = 5; // 1
a.store(true,std::memory_order_seq_cst); //2
}
void thread2(){
a.load(true,std::memory_order_seq_cst); // 3
int y = x; // 4
}
then y==5
.
- Step 2 - guarantees that any reads from
a
which are executed later will observe x = 5
. Meaning, this prevents the compiler from exchanging steps 1 and 2 - compiler barrier at least.
- Step 3 - ensures that all those writes/read that happened before step 2 are actually visible to
thread2
- CPU sync of caches or whatever is necessary. It also prevents reordering step 4 before step 3.
Just be careful that the memory model does not constraint execution order, only the visibility of the chosen execution order. For the former, you need locks or explicit (non-memory) barriers. If step 3 happens to run before step 2, then step 4 is still undefined behaviour.