I'm currently taking a deep look at std::atomics
and the C++ memory model. What really helped my mental model is the concept of the store and load buffer of the CPU, which is basically a fifo queue for data that has to be written or read to/from the L1 cache which is present in Intel architectures at least. I understand that atomic operations are basically instructions to the CPU which prevent the wrapped type from tearing AND reordering write or read instructions across the barrier at compiletime or at runtime. To illustrate the gap in my mental model I quickly came up this example:
#include <atomic>
#include <iostream>
#include <thread>
int a;
int b;
int c;
std::atomic<int> x;
int e = 0;
auto thread1() {
while(1) {
a = 3;
b = 5;
c = 1;
x.store(10, std::memory_order::release);
e++;
std::cout << "stored!" << std::endl;
}
}
auto thread2() {
while(1) {
x.load(std::memory_order::acquire);
std::cout << b << std::endl;
}
}
int main() {
[[maybe_unused]] auto t1 = std::thread(&thread1);
[[maybe_unused]] auto t2 = std::thread(&thread2);
}
Here, one thread writes to the global variables a,b,c, the atomic variable x and normal variable e (read before increment) while the other thread reads from the atomic variable x and normal variable b. Assume for the next part that both threads actually run on different CPU cores. Also keep in mind that this simple example completely ignores contention synchronisation and only serves a static example.
Now this is my mental model of the process:
As you can see the store buffer feeds data in an ordered manner into the L1 cache. The data then propagates through the L2 and L3 caches to main memory. Nobody knows when it will arrive there though, but it will arrive in full cachelines of 64 Kb (on most architectures). Let's assume now that the global variables a,b,c happen to be placed on a different cachelines than x and e. Which sparked my question: How will the memory controller know to propogate the two cachelines such that the memory ordering implied by the atomic operation on x is respected?
What I mean is that if cacheline 1) happens to arrive in main memory before CL 2), everything is fine, the newly written values of a,b and c are visible to other thread before the store of x. But what if the reverse happens? If cacheline 2) propogates first, the write to x and possibly e will be visible BEFORE the write to a,b and c which would result in an invalid memory ordering. This must be prevented somehow. I figured out a few possible solutions:
- The memory controller will propagate cachelines always in the same order as they're updated in L1. As CL 2) is updated after 1) it will push 1) to main first before 2) and the constraints are satisfied.
- The memory controller somehow "knows" about the ordering relationships of the cachelines and basically keeps a mental note of which CL to propagate first through the caches.
There might be other solutions I can't think of right now, but I think understanding this puzzle piece would help me complete my mental understanding to an acceptable amount of detail. Also please correct me if my understanding is somehow flawed.