How does the memory controller guarantee memory ordering of atomics when propagating cachelines?

Question

I'm currently taking a deep look at std::atomics and the C++ memory model. What really helped my mental model is the concept of the store and load buffer of the CPU, which is basically a fifo queue for data that has to be written or read to/from the L1 cache which is present in Intel architectures at least. I understand that atomic operations are basically instructions to the CPU which prevent the wrapped type from tearing AND reordering write or read instructions across the barrier at compiletime or at runtime. To illustrate the gap in my mental model I quickly came up this example:

Demo

#include <atomic>
#include <iostream>
#include <thread>

int a;
int b;
int c;
std::atomic<int> x;
int e = 0;

auto thread1() {
    while(1) {
        a = 3;
        b = 5;
        c = 1;
        x.store(10, std::memory_order::release);
        e++;
        std::cout << "stored!" << std::endl;
    }
}

auto thread2() {
    while(1) {
        x.load(std::memory_order::acquire);
        std::cout << b << std::endl;
    }
}


int main() {
    [[maybe_unused]] auto t1 = std::thread(&thread1);
    [[maybe_unused]] auto t2 = std::thread(&thread2);
}

Here, one thread writes to the global variables a,b,c, the atomic variable x and normal variable e (read before increment) while the other thread reads from the atomic variable x and normal variable b. Assume for the next part that both threads actually run on different CPU cores. Also keep in mind that this simple example completely ignores contention synchronisation and only serves a static example.

Now this is my mental model of the process:

As you can see the store buffer feeds data in an ordered manner into the L1 cache. The data then propagates through the L2 and L3 caches to main memory. Nobody knows when it will arrive there though, but it will arrive in full cachelines of 64 Kb (on most architectures). Let's assume now that the global variables a,b,c happen to be placed on a different cachelines than x and e. Which sparked my question: How will the memory controller know to propogate the two cachelines such that the memory ordering implied by the atomic operation on x is respected?

What I mean is that if cacheline 1) happens to arrive in main memory before CL 2), everything is fine, the newly written values of a,b and c are visible to other thread before the store of x. But what if the reverse happens? If cacheline 2) propogates first, the write to x and possibly e will be visible BEFORE the write to a,b and c which would result in an invalid memory ordering. This must be prevented somehow. I figured out a few possible solutions:

The memory controller will propagate cachelines always in the same order as they're updated in L1. As CL 2) is updated after 1) it will push 1) to main first before 2) and the constraints are satisfied.
The memory controller somehow "knows" about the ordering relationships of the cachelines and basically keeps a mental note of which CL to propagate first through the caches.

There might be other solutions I can't think of right now, but I think understanding this puzzle piece would help me complete my mental understanding to an acceptable amount of detail. Also please correct me if my understanding is somehow flawed.

I don't know how the CPU actually does it, but an atomic access inside a cache line marks that entire line as dirty to the rest to the cores forcing other CPU cores to update the cache line to get the updated value. — NathanOliver, Apr 10 '23 at 17:29
@NathanOliver Interesting piece of information. Building on that one could assume that all cachelines containing happens-before data would have to be marked dirty. If this is the case then the order would be automatically enforced as the reloading of the new cachelines would bring along the new values — glades, Apr 10 '23 at 18:11
Like Nathan, most of us don't know (or care :-) about the nasty details, but at least on an x86 system there is a lot more communication between cores and caches than just updating main memory. When a cache line becomes dirty, the other caches will be made aware of that - through some magic. It's probably all on the same chip, right? — BoP, Apr 10 '23 at 18:16
You should probably make each variable volatile, or document that you compile with no optimization, which is the same. — curiousguy, Apr 11 '23 at 16:01
edited my answer to add another example of a way ISO C++ is weaker than most real hardware that does work like this model: release sequences are only continued by RMWs, not pure stores. — Peter Cordes, Apr 12 '23 at 21:30

Peter Cordes · Accepted Answer · 2023-04-12T21:30:05.120

Nobody knows when it will arrive there though

Inner caches participate in the cache-coherency protocol. AFAIK, all modern CPUs use some variation of MESI. (The wikipedia article describes it in terms of processors snooping a shared bus, but actual CPUs use a "directory", e.g. Intel CPUs with an inclusive L3 cache use L3 tags to keep track of which core might have a modified copy of a cache line. Skylake-Xeon and later have a separate coherency directory and snoop filter since their L3 cache is NINE (not inclusive, not exclusive).)

Before a store can commit from the store buffer to a cache-line, this core has to get exclusive ownership of it. If it wasn't already in Modified or Exclusive state, it needs to do a Read For Ownership which invalidates any copies in other cores, and wait for a response to acknowledge that it has ownership. Only then can it commit the store, making it globally visible at that point. (A share request or RFO from another core could come in and see the value.)

With a coherent cache, memory reordering is only local (within each CPU core, ordering of its loads and stores to coherent cache). e.g. the store buffer delays loads and out-of-order exec (or just in-order with a hit-under-miss cache) does loads early and possibly out-of-order.

The interconnect between cores may have to take some care to not introduce new memory reordering, but mostly it Just Works by maintaining coherency. In practice the hardware does maintain ordering.

So memory barrier instructions just have to make later memory operations wait for some earlier things to complete, e.g. for the store buffer to drain if it's a full barrier like x86 mfence. Or not if it's just an acq_rel fence (https://preshing.com/20130922/acquire-and-release-fences/).

Note that atomic operations like store(val, release) only have one-way ordering, they can reorder with later stores and loads, but not earlier. (https://preshing.com/20120913/acquire-and-release-semantics/). Fences are different from operations, and stronger, needing to be 2-way.

x86's strongly-ordered memory model has to preserve the illusion of program order (plus a store buffer with store-forwarding which can lead to StoreLoad reordering), but weakly-ordered ISAs are allowed to create LoadLoad, StoreStore, and even LoadStore reordering. (https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/) In that case the load and store buffers aren't FIFOs.

Even x86 CPUs do speculatively load out of order, and check later that the cache line is still owned once the load was architecturally allowed to happen, probably at retirement. (This is why x86 has machine_clears.memory_order pipeline nukes when another core is modifying data we're loading.)

Shared cache is the backstop for coherency (L3 in this case); data doesn't have to go all the way to DRAM. Even on a multi-socket Xeon, or a big Zen with multiple L3 clusters, they can communicate with each other and pass cache lines to one another without a slow round-trip to DRAM as well.

In general, the C++ memory model as defined by the standard is a lot weaker than anything you can explain in terms of simple cache-coherent hardware. e.g. IRIW reordering is only possible in real life on a few machines, such as POWER, with store-forwarding between logical cores on the same physical core. See Will two atomic writes to different locations in different threads always be seen in the same order by other threads?

Another good example is that only atomic RMWs continue a release-sequence, so an acquire load only establishes a happens-before relationship with the release-store whose value it loads, not with earlier release stores in the modification order. But in a machine where stores can only become visible to other cores via commit to a coherent cache, pure stores do also have to wait for ownership of the cache line, so in terms of cache coherency are similar to RMWs. So the coherent-cache model can't explain such effects allowed on paper by the ISO C++ standard.

In early drafts of the C++ standard, it was going to guarantee that even pure stores continue a release-sequence, but this got changed. Perhaps because of some real hardware where that's not the case, like maybe PowerPC, or at least some ISAs that don't guarantee something on paper.

Don't fall into the trap of thinking that something might be guaranteed in the ISO standard's formalism just because you can't think of a hardware mechanism that can explain a reordering that violates it. (But thinking in hardware terms is useful in the other direction: if you know how C++ atomics compile to asm for AArch64 for example, and can think of a way that a reordering is possible on AArch64, it must not be guaranteed by ISO C++. It's exceedingly unlikely that there's a bug in the mappings.)

On most modern CPUs, a cache line is 64 bytes (64 B), not 64 kilobits (64 Kb = 8KB).

How does the memory controller guarantee memory ordering of atomics when propagating cachelines?

1 Answers1

Linked