How does the x86 TSO memory consistency model work when some of the stores being observed come from store-forwarding?

Question

I've a question about x86 TSO memory consistency model. Starting from 'A primer on Memory Consistency and Cache Coherence' it seems to me that the 'global store order' (i.e. the global order of stores in memory order) could be different from the point of view of cores involved.

Suppose you have 2 cores each with a FIFO write buffer as in the following model:

Each Core issues/inserts stores into its write buffer in program order and I believe the global store order basically is what defined by the order the 'memory switch' uses to select Cores to 'drain' their write buffers.

Now adding the store-load forwarding for loads that bypass last stores in program order on each Core, the sequence of loads that each program running on a Core sees can be different form what seen from the program running on the other Core (even if the subset of stores issued from a program running on a core is actually in its program order).

Maybe I'm missing some point...

You're correct, the store buffer lets a core see its own stores in an order different from the global-visibility order, unless it uses `mfence` or something else that drains the store buffer before loading. My answer on [Globally Invisible load instructions](https://stackoverflow.com/a/50617986) covers some of the same territory, but that question is asking about similar subject matter from a different angle so IDK if it's a duplicate. — Peter Cordes, Nov 11 '21 at 09:00
So let me say the name 'Total Store Order (TSO)' actually refers to the unique global order in which stores from all cores are seen from an hypothetical agent attached to the memory ? — Carlo C, Nov 11 '21 at 10:03
No, not DRAM, the eviction order from write-back caches can be totally different. But yes, a hypothetical agent, such as another core that's part of the MESI coherency domain but doesn't do any stores of its own. The global order is relevant for things like the IRIW litmus test: two readers reading in opposite order on locations written to by two other independent writers: the readers will never disagree on what order the stores happened. [Will two atomic writes to different locations in different threads always be seen in the same order by other threads?](//stackoverflow.com/a/50679223) — Peter Cordes, Nov 11 '21 at 10:08
The other part of TSO is that the global order is some interleaving of program order on each core. But reloading your own stores is always special. They won't be visible early to any *other* core, but they are to yourself. See also [Can x86 reorder a narrow store with a wider load that fully contains it?](https://stackoverflow.com/posts/comments/75186763) for some discussion of the way Intel explains this in their manuals. — Peter Cordes, Nov 11 '21 at 10:12
Oh, then yes, as observed by memory, or by the coherent cache. Loads that take data from store-forwarding aren't reading from coherent cache. — Peter Cordes, Nov 11 '21 at 10:44
@PeterCordes as reply to your previous comment, I was thinking of a simple 'switch memory model' without any cache coherence system (no MESI or other cache coherency system). W.r.t. last your comment you said that the global store order is some interleaving of program order on each core. But such interleaving could be different for each core (since the store-forward feature that allows for the reloading of its own stores) ? — Carlo C, Nov 11 '21 at 10:48
Loads that come from store-forwarding aren't reading from the *global* order at all (the store isn't globally visible yet; we don't yet know how it will interleave with others). They're reading a private view of memory that only this core sees. See [Globally Invisible load instructions](https://stackoverflow.com/a/50617986) — Peter Cordes, Nov 11 '21 at 11:22
ok, so restricting to just globally visible loads (i.e. ruling out the loads that come from store-forwarding) they see the same unique global store order regardless of the Core from which they are issued, right ? — Carlo C, Nov 11 '21 at 12:02
Yes, exactly. That's true on almost all systems; in practice only some POWER CPUs violate it. TSO means that global-visibility order is some interleaving of program order. — Peter Cordes, Nov 11 '21 at 12:09
As far as I can understand, if we extend to include *all* loads then the requirement that an unique global-visibility order is some interleaving of program order on each core brings to Sequential Consistency (SC). Thank you. — Carlo C, Nov 11 '21 at 12:26
Not exactly. That requirement would mean that any store/reload is a full memory barrier (as discussed in [Can x86 reorder a narrow store with a wider load that fully contains it?](https://stackoverflow.com/a/35910141)), but it wouldn't rule out StoreLoad reordering when a thread stores to one location and then loads from another. As in https://preshing.com/20120515/memory-reordering-caught-in-the-act/ — Peter Cordes, Nov 11 '21 at 12:37
Sorry, I cannot move this discussion to a chat since my low reputation. Maybe I was unclear: SC rules out any write buffers so there is no room for *reloading* from store-forwarding at all (all loads & stores from all core are always globally visible/done). So the requirement that the unique global-visible stores & loads order is some interleaving of program order, actually defines SC consistency. — Carlo C, Nov 11 '21 at 13:33
Oh, yes if there's a total order of all *loads* and stores that's an interleaving of program order, yes, that's the definition of SC. And yes it means draining the store buffer before the next load from shared memory. It doesn't fully make sense to say a load is "globally visible", only to talk about what values it was allowed to load, i.e. whether it can only load things that some other thread could have loaded. e.g. if loads can only ever read from coherent cache. — Peter Cordes, Nov 11 '21 at 13:41
Fun fact: AArch64's LDAR / STLR can achieve SC *for data-race-free programs* (i.e. when compiling C++ using std::atomic) without flushing the store buffer after every STLR store. Only making sure it's drained before the next LDAR reads a value. So it can still reorder with later non-atomic loads (and stores because AArch64's memory model isn't TSO). So "an SC store" on AArch64 doesn't need to flush the store buffer on the spot to be able to recover sequential consistency, as long as nobody looks at non-atomic variables without synchronization. — Peter Cordes, Nov 11 '21 at 13:44

score 2 · Answer 1 · edited Feb 26 '22 at 07:59

Modern CPUs have coherent caches. A coherent cache is implemented by a cache coherence protocol. When a store wants to commit to the coherent cache, it first needs to wait for the cache line to be invalidated on all CPUs. Any other CPU that would access the cache after the cache line has been invalidated, will not see an older version.

Writing to a cache line is linearizable because the effect of writing to a cache line takes place between the start of the write and completion of the write (its linearization point is between these 2 events). Reading from the cache is linearizable as well since its linearization point is also between start and completion.

Because accessing a cache line is linearizable and linearizability is composable, the cache (the set of all cache lines) is linearizable. Because the cache is linearizable, there is guaranteed to be a total order over all reads and writes from the cache.

This total order might or might not be consistent with the program order; this is not a concern of the cache. E.g. out-of-order execution or committing stores from the store buffer, into the cache, out of order might break program order. But at least nobody will be able to disagree on the order in which loads/stores are done on the cache.

The problems that prevent having a total order on the memory order are caused by the layers above the cache. For example, the store buffer of one CPU might be shared with a strict subset of the CPUs (e.g. PowerPC). This way some CPUs can see stores early and others can't. Another cause is STLF (Store To Load Forwarding, aka store forwarding).

How does TSO/STLF fit into the picture:

Because older stores can be reordered with newer loads to different addresses, due to store buffers, we need to give up the [StoreLoad] ordering within memory order. But this doesn't mean we can't have a total order over all loads/stores; it just means we need to relax the program order restrictions within the memory order.

The tricky part is STLF. Because with STLF there can be disagreement on how a load is ordered with respect to the store it reads from. Example:

int a=0, b=0

CPU1
   a=1    (1)
   r1=a   (2)
   r2=b   (3)

CPU2:
   b=1    (4)
   r3=b   (5)
   r4=a   (6)

Condition:
    r1=1,r2=0,r3=1,r4=0

The condition is possible with TSO. The load 'r1=a' can only be ordered after the store 'a=1' (a load can't be ordered in the memory order before its store it reads from).

But 'r1=a' is also ordered before 'a=1' due to STLF: it takes its value from the store buffer, before 'a=1' is inserted into the memory order.

So we end up with a cycle in the memory order. To resolve this problem, the load 'r1=a' and the store 'a=1' are not ordered by the memory order. As a consequence, we are just left with a total order over the stores.

The book "A primer on Memory Consistency and Cache Coherence" is a superb book. A lot of my understanding on this topic comes from this book and the linearizability argument and dealing with the STLF cycle from this book (I also had a few message exchanges with one of the authors).

What does it mean that a cache access is linearizable ? Is it related someway to the atomicity from the point of view of accesses done from all CPU cores ? — Carlo C, Feb 27 '22 at 11:05

Carlo C · Answer 2 · 2021-11-11T21:12:36.783

Starting from comments received, I would try to answer my question.

Let's assume Cores execute load & store memory operations in program order in the model above. Each Core inserts its own stores into its FIFO write buffer (store queue).

When a load executes there are 2 cases:

there is at least a store to the same address as the load waiting inside the write buffer (store queue): store-load forwarding takes place and the load is satisfied from the store queue and it does not hit the memory system
there is no store to the same address as the load waiting inside the store queue: the load has to be satisfied from the memory system, so the Core actually stalls waiting for the 'memory switch' to select it to serve the load and drain the pending stores inside the store queue

This way TSO memory consistency is satisfied and its global order is defined by the sequence in which stores from all cores become globally visible (i.e. drained from the cores's store queues into the memory according to the sequence defined by the 'memory switch')

score 1 · Answer 3 · answered Feb 26 '22 at 08:09

How does the x86 TSO memory consistency model work when some of the stores being observed come from store-forwarding?

80x86 itself is not TSO, it's "TSO with store forwarding", so TSO does not necessarily work.

To enforce a TSO consistency model you may have to use explicit barriers (e.g. mfence) to prevent some store forwarding.

For a specific example, consider one CPU doing:

 C = -1;

 A = 0;
 B = 0;
 if(B == 1) C = A;

.. while another CPU does:

 A = 1;
 B = 1;

For TSO, it would be impossible for the result to be C == 0 - either the other CPU didn't do B = 1 fast enough and therefore C = -1, or the other CPU must have already done A = 1 and therefore C = 1.

For "TSO with store forwarding" the first CPU can see that the other CPU did B = 1 and then forward its old value of A from the store at A = 0 to the C = A; and therefore C == 0 is possible (when it should not be possible for TSO).

As a general rule; whenever a CPU's reads might not occur in the same order as its writes, you need a barrier to enforce TSO.

For my example, the first CPU writes A then B, but then reads B then A (which is the opposite order), so it needs a barrier to enforce TSO. It'd be like:

 C = -1;

 A = 0;
 B = 0;
 MFENCE;
 if(B == 1) C = A;

Of course for most programs differences between TSO and "TSO with store forwarding" are extremely rare. Often the programming language (e.g. C) doesn't assume TSO in the first place, and often you have some kind of lock or mutex to enforce a specific order regardless of what the CPU does.

ok, so for simple TSO the purpose of store buffer is just do not stop the CPU core execution avoiding wasting clock cycles to bring the relevant local cache line in the appropriate state to commit the write. — Carlo C, Feb 27 '22 at 08:38
@CarloC: Yes; the purpose of the store buffer is to speed things up in the majority of situations where (e.g.) it's single-threaded code, or the code is modifying "non-shared" data, or the code acquired a lock/mutex anyway. It does this by plucking values directly out of store buffer; which avoids the need to wait until the value is stored in cache (which can be a much slower "fetch the unmodified part of the cache line, then modify it, then write the whole cache line" operation). — Brendan, Feb 27 '22 at 12:07
My understanding of TSO from "A primer on Memory Consistency and Cache Coherence" actually includes the store->load forwarding from the same address hence TSO="TSO with store forwarding" I think. — Carlo C, Feb 27 '22 at 15:52

How does the x86 TSO memory consistency model work when some of the stores being observed come from store-forwarding?

3 Answers3

How does TSO/STLF fit into the picture: