Do C++ memory barriers affect only the code in a function?

Question

From what I understand, the barrier prevents the read/write operations before it from being reordered by the compiler and the CPU, so that they do not happen before the operations after the barrier.

However does this only apply to the function it's in? What if it gets inlined then? Or does it just cause some some CPU buffers to flush?

Optimizations are not allowed to affect the observable behavior of your code. As long as your code has well defined behavior the compiler is not allowed to break that (disregarding compiler bugs of course). — NathanOliver, Mar 06 '23 at 20:03
What kind of memory barriers are you talking about? Atomic operation barriers (memory orders for atomics, or `std::atomic_thread_fence`)? Or are you talking about `asm volatile ("" ::: "memory")` and similar? In either case the compiler will never reorder anything (and inlining will not matter in that regard), but the `asm volatile` barrier is only a barrier for the compiler itself, not for the CPU. (The compiler will not actually generate any instructions for it.) — chris_se, Mar 06 '23 at 20:07
Yes, but will this prevention of reordering only apply to the function in which I have written the barrier? — ulak blade, Mar 06 '23 at 20:11
Writing `x = 42; std::atomic_thread_fence(std::memory_order_seq_cst); y = 50;` and `void do_fence() { std::atomic_thread_fence(std::memory_order_seq_cst); } x = 42; do_fence(); y = 50;` will have the same ordering guarantees, even if the function is defined in a different translation unit or even a separate library (DLL/shared object), if that's what you mean. Function calls that the compiler can't see into (due to being defined elsewhere) will always produce a compiler memory barrier (akin to the `asm volatile` version) to ensure this works, but they won't produce an automatic CPU barrier. — chris_se, Mar 06 '23 at 20:39

Quimby · Answer 1 · 2023-03-07T20:34:49.440

TL;DR yes, the barrier concerns the whole thread/program, regardless of any functions calls.

I feel like you might be mixing two things.

Let's have two threads execute a each some sequence of read and write instructions somehow interleaved.

Then for the same address A, value X and for instructions write(A, 'X'); y = read(A) there are basically two cases according to the C++ memory model:

a) If both instructions execute on the same thread, read is guaranteed to return 'X'->y=='X'.
b) If the instruction happen in different threads, there are no guarantees, it is undefined behaviour unless synchronized explicitly through some synchronization primitives.

In other words, how the compiler generated the sequence of instructions is kind of irrelevant to you - it either just works or you should not be doing it.

The compiler can reorder both C++ statements and the corresponding CPU instructions as it sees fit as long as the observable result is the same as the sequential execution as per C++ rules of evaluating expressions and statements. But as long as you cannot observe the difference, the compiler can to almost anything it wants.

Of course the compiler can never reorder what it cannot into see because it might have well-defined observable side effect. Therefore calls with virtual, across TUs without -flto, to shared libraries are not reordered. But relying on this for observability across threads is undefined behaviour.

All of that happens inside the C++ machine model, none of that gives you any guarantees on what sort of CPU instructions are executed at all.

Furthermore, C++ explicitly gives no promises how the sequence of CPU instructions is observable from any other thread or from the outside world for that matter unless explicitly synchronized. If the compiler observes that writing to some memory location is redundant because the thread/program itself cannot tell the difference, it does not have to write anything. For example:

int* ptr = ...
*ptr=42;
int x = *ptr;
// Can be just replaced with and thus no memory is written to at all
int x = 42;

You are not saying, write to 42 to the memory, no, you are saying the program must behave as if you have written it to the memory and unless ptr is synchronized across threads, the compiler will not care about other threads' accesses to ptr at all.

Going on, C++ memory model operates by default on per-thread basis with only a specific set of primitives (atomics, locks, barriers...) which can be accessed from multiple threads. Only for them, the access is synchronized and therefore it is only for them where visibility of CPU read/write instructions play any role at all, and it is only around them where the visibility of effect from all other instructions is defined.

The details are on cppreference but the idea is that any access to the shared primitive can be used to constraint the observability of the executed CPU instructions in multiple threads.

Operations on the shared primitive can force C++ to constraint the reordering of the generated CPU instructions to the rules of C++ evaluation order.

For example for the following shared variables

int x = 0;
std::atomic_bool a;

and two functions called in parallel and executed in the commented order

void thread1(){
    x = 5; // 1
    a.store(true,std::memory_order_seq_cst); //2
}
void thread2(){
    a.load(true,std::memory_order_seq_cst); // 3
    int y = x; // 4
}

then y==5.

Step 2 - guarantees that any reads from a which are executed later will observe x = 5. Meaning, this prevents the compiler from exchanging steps 1 and 2 - compiler barrier at least.
Step 3 - ensures that all those writes/read that happened before step 2 are actually visible to thread2 - CPU sync of caches or whatever is necessary. It also prevents reordering step 4 before step 3.

Just be careful that the memory model does not constraint execution order, only the visibility of the chosen execution order. For the former, you need locks or explicit (non-memory) barriers. If step 3 happens to run before step 2, then step 4 is still undefined behaviour.

Do C++ memory barriers affect only the code in a function?

1 Answers1