Coherence protocol and store buffer

Question

Consider the code below:

std::atomic<int> a = 100;
---
CPU 0:
a.store(101, std::memory_order_relaxed);
---
CPU 1:
int tmp = a.load(std::memory_order_relaxed);  // Assume `tmp` is 101.

Let's assume that CPU 0 happens to store to a earlier in time before CPU 1 loads a (whether the load is reordered or not). Thus, in this scenario, tmp will be 101 instead of 100.

If the MOESI coherence protocol is used, then when CPU 0 stores to a, CPU 0 acquires the cache line in modified (M) mode. The store goes to CPU 0's store buffer. If CPU 1 had the cache line in its own cache, then its copy of the cache line transitions to invalid (I) mode.

When CPU 1 loads a, the cache line is transitioned to shared (S) mode (or maybe owned (O) mode).

Assume that a is still in CPU 0's store buffer when CPU 1 loads a. Given that CPU 1 cannot read CPU 0's store buffer, then when CPU 1 reads the cache line with a, does this imply that CPU 0's store buffer is flushed (or at least, the cache line with a is flushed from CPU 0's store buffer)?

If the flush did not happen, then this implies that both CPU 0 and CPU 1 both have the cache line in shared (S) mode, but CPU 0 sees a with the value of 101 and CPU 1 sees a with a value of 100.

Note: I am asking about MOESI while each microarchitecture implements its own coherence protocol. I would imagine that this concern is handled similarly in most microarchitectures though.

@RichardCritten Hi Richard, thanks for sharing the link. I am curious about how this specific scenario is handled by the coherence protocol and the store buffer, assuming the ordering just happens to work out the way I describe in the question (even though it won't always work out that way). — Jack Humphries, Sep 01 '23 at 21:53
A store only happens (from the perspective of cache coherency protocols) when it *leaves* the store buffer — harold, Sep 01 '23 at 21:53
@harold Hi Harold, thanks for your response. If CPU 0 has stored to `a` and then the coherence protocol wants to move the cache line to shared mode, where does the store go? If it stays in the store buffer, then CPU 0 will see one value in its store buffer while CPU 1 sees another value for the cache line from the cache hierarchy, even though both CPUs have the cache line in shared mode. — Jack Humphries, Sep 01 '23 at 21:55
As far as I know that's just a natural consequence of the TSO memory model — harold, Sep 01 '23 at 22:39

score 1 · Accepted Answer · answered Sep 02 '23 at 00:20

Store buffers aren't snooped by loads from other cores; they're private. Stores become globally visible when they commit from the store buffer to L1d cache. (The core has to get MESI exclusive ownership of the line before it can do that, E or M state.)

This has to wait until after the store instruction has graduated, aka retired from the ROB (ReOrder Buffer) so it's known to be non-speculative. A store buffer is necessary to allow speculative execution of stores, containing to this core the speculative state that might need to be rolled back if mis-speculation is detected (e.g. a branch mispredict or a fault in an earlier instruction).

A core can see its own stores (via store forwarding) before they become globally visible (to any other cores). This "reordering" is somewhat separate from the usual StoreLoad reordering introduced by a store buffer when later loads are to different addresses. See also Globally Invisible load instructions for some discussion of it. (And fun corner cases like a load that partially overlaps with a store seeing a value that no other core could ever have seen.)

x86's TSO memory model is program order with a store buffer + store forwarding¹ for each core's accesses to coherent shared cache. (See Preshing's analogy, Memory Barriers Are Like Source Control Operations.) It's important to mention store-forwarding, because it can produce effects you wouldn't see if a load that "hit" an address already in the store buffer just stalled until the store buffer committed to cache.

A cache line has to be exclusively owned before the store can commit to L1d (and become globally visible), but store forwarding to this core's own loads can happen without that.

(On most architectures, commit to L1d and MESI coherency is the only way for a store to become visible outside the current core at all. But PowerPC allows forwarding "graduated" stores to the other logical SMT cores, making IRIW reordering possible.)

Footnote 1: This is what 486 or P5 Pentium "naturally" did, with in-order pipelines and a store buffer, before an x86 memory model was really documented. P6 took pains not to introduce any new memory-reordering to avoid breaking existing multi-threaded code. It speculatively loads early, but rolls back with a memory-order mis-speculation pipeline nuke if it detects that the cache line has been invalidated between when it actually loaded and when it's architecturally allowed to load.

This is a great answer, thanks so much Peter! I also came across this great answer from John McCalpin, which is in line with everything you wrote: https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Draining-store-buffer-on-other-core/m-p/1040738 — Jack Humphries, Sep 02 '23 at 06:52
Peter, is there any way I can email you? Would love to set up a "coffee" chat if you're free! Would also love to have you come speak to the systems PhD students and faculty at Stanford if you want (but is no way required). You've been so amazing over the years :) — Jack Humphries, Sep 02 '23 at 06:57
@JackHumphries: I'm at pcordes@gmail.com. IDK what kind of topic I might speak on; I'm just an interested amateur for the most part :P (but have been interested for enough years to have picked up quite a bit of detail, and do some freelance performance tuning / SIMD stuff. :) (And yeah, John "Dr. Bandwidth" McCalpin definitely knows what he's talking about; you can trust his posts to have correct info about computer architecture stuff, including low-level details that are the subject of misconceptions by many people.) — Peter Cordes, Sep 02 '23 at 07:01

Coherence protocol and store buffer

1 Answers1