Reason for the name of the "store buffer" litmus test on x86 TSO memory model

Question

I've been studying the memory model and saw this (quote from https://research.swtch.com/hwmm):

Litmus Test: Write Queue (also called Store Buffer)
Can this program see r1 = 0, r2 = 0?
// Thread 1           // Thread 2
x = 1                 y = 1
r1 = y                r2 = x
On sequentially consistent hardware: no.
On x86 (or other TSO): yes!

Fact 1: This is the store buffer litmus test mentioned in many articles. They all say that both r1 and r2 being zero could happen on TSO because of the existence of the store buffer. They seem to assume that all the stores and loads are executed in order, and yet the result is both r1 and r2 being zero. This later concludes that "store/load reordering could happen", as a "consequence of the store buffer's existence".
Fact 2: However we know that OoO execution could also reorder the store and the load in both threads. In this sense, regardless of the store buffer, this reordering could result in both r1 and r2 being zero, as long as all four instructions retire without seeing each other's invalidation to x or y. And this seems to me that "store/load reordering could happen", just because "they are executed out of order". (I might be very wrong about this since this is the best I know of speculation and OoO execution.)

I wonder how these two facts converge (assuming I happen to be right about both): Is store buffer or OoO execution the reason for "store/load reordering", or both are?

Alternatively speaking: Say I somehow observed this litmus test on an x86 machine, was it because of the store buffer, or OoO execution? Or is it even possible to know which?

EDIT: Actually my major confusion is the unclear causality among the following points from various literatures:

OoO execution can cause the memory reordering;
Store/load reordering is caused by the store buffer and demonstrated by a litmus test (and thus named as "store buffer");
Some program having the exact same instructions as the store buffer litmus test is used as an observable OoO execution example, just as this article https://preshing.com/20120515/memory-reordering-caught-in-the-act does.

1 + 2 seems to imply that the store buffer is the cause, and OoO execution is the consequence. 3 + 1 seems to imply that OoO execution is the cause, and memory reordering is the consequence. I can no more tell which causes which. And it is that litmus test sitting in the middle of this mystery.

The title question was intended to be about why it's called that, right? Not why a StoreLoad reordering litmus test with that code is useful. (It's rarely relevant; usually acq/rel sync is enough for inter-thread communication, but as a way to specify what reordering effects are allowed it's essential. https://preshing.com/20120515/memory-reordering-caught-in-the-act/ is a real-world implementation of this litmus test for x86). I edited your title to clarify that, since that's the direction I'm going in the answer I'm writing. — Peter Cordes, Sep 09 '21 at 05:15
I put some editing in the question body after I re-organized my thoughts. But yeah your editing to the tile is still the direction I'm looking for, plus your extensive answer below resolves all my confusion. So the editing surely LGTM. — zanmato, Sep 09 '21 at 18:37
You wrote Preshing's test was used to detect "OoO execution". That's not exactly what he says. "*As a result, it might end up as though the instructions had executed in this order:*". He does use the word "executed", but I think he really means as if program-order had been the opposite. Or as if executed on a serial machine in the other order. He's *not* trying to suggest that "OoO exec" was the specific mechanism creating the reordering. — Peter Cordes, Sep 09 '21 at 22:52
See https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ where he explains how a "pipeline" can delay stores, and (at the bottom) that real CPUs have store buffers so the analogy only goes so far. — Peter Cordes, Sep 09 '21 at 22:52
Right, I think I was wrong about understanding Preshing's test. It's merely saying that we can observe real-world "memory reordering", without mentioning it's for any particular reason (i.e., OoO execution or store buffer). After all your explanation, I might get more confidence about the litmus test in literatures: they are all highly hypothetical programs with the IMPLICIT presumption (except explicitly mentioned) that all the instruction are executed in order, to illustrate how an micro-architectural component could affect memory ordering. — zanmato, Sep 10 '21 at 04:58
Cont: And when a particular litmus test (e.g. store buffer) comes to a real-world program on a real-world hardware (e.g. x86), one must realize that this program may be insufficient (e.g. missing necessary fences to prevent OoO) to illustrate what a literature describes (e.g. store buffer causes load ahead of older store), due to the potential same effect (e.g. load ahead of older store) introduced by another micro-architectural component (e.g. OoO execution). This is something else I learned from this question. — zanmato, Sep 10 '21 at 05:12
I don't think there's any implied *fence* in a normal description of a litmus test, at least not to exclude OoO exec. The litmus test just shows program order, and catalogues possible results allowed by the ISA on paper, regardless of how that happens in different microarchitectural implementations of the ISA. It turns out that there are mechanisms that can produce all the basic LoadStore / LoadLoad / etc. memory-reordering effects without OoO exec, but there's zero implication that a litmus test should be excluding that. — Peter Cordes, Sep 10 '21 at 05:20
Stuff like IRIW implies that the two readers order their loads, which may need barriers, and yes that's something that gets glossed over sometimes. But again, the key point is to show you that if your real-world code boils down to this pattern of stores and loads, these are the allowed outcomes according to the rules on paper. (For some more obscure litmus tests, like IRIW on ARMv7, it was allowed on paper but no HW ever did it; that's why ARMv8 was able to strengthen the rules to make the ISA multi-copy atomic, i.e. guarantee that observers can't disagree on global order of stores.) — Peter Cordes, Sep 10 '21 at 05:24
Let's take the store buffer litmus test as an example, though the naming (yeah, we're back on it!) strongly implies that "it is the store buffer causing the result", what it actually says is that the store buffer could be one reason for the outcome but not necessarily be the only reason - that is: Once you observe that outcome on a real hardware, it could be a solely effect of something else (e.g. OoO exec)? — zanmato, Sep 10 '21 at 08:17
Where else are you seeing it called a "store buffer" litmus test? That site you linked seems to have just made up its own names for them, somewhat arbitrarily. https://www.cl.cam.ac.uk/~pes20/weakmemory/x86tso-paper.pdf (a formal description of x86's memory model) names their litmus tests things like "Example 7-3. Loads May be Reordered with Older Stores.". Which isn't really a name, it's a full caption for the example. "Example 7-4. Loads not Reordered with Older Stores to the Same Location." is described as showing that store-forwarding is required (i.e. a core sees its own stores). — Peter Cordes, Sep 10 '21 at 08:26
https://www.kernel.org/doc/Documentation/memory-barriers.txt also has some litmus tests / examples and doesn't name them at all. — Peter Cordes, Sep 10 '21 at 08:28
https://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf this is one written by the same bunch of people as yours and it is listed in the introduction. http://www0.cs.ucl.ac.uk/staff/j.alglave/papers/tacas11.pdf and this one by different people too. — zanmato, Sep 10 '21 at 11:22
Ok, good find, yeah they do label the test SB. (And say "Microarchitecturally, one can view this particular example as a visible consequence of store buffering".) A store buffer alone can create this effect in an otherwise sequential processor, so the name is somewhat apt, even if high-performance CPUs introduce other ways for it to happen. But that paper you linked does come awfully close to saying (or actually does say) something misleading. — Peter Cordes, Sep 10 '21 at 11:43
Yeah, I don't think I'm entangled by these misleadings any more, esp. after all your nice and clear elaboration. It's been so helpful and thank you again, Peter! — zanmato, Sep 10 '21 at 15:31

Peter Cordes · Accepted Answer · 2021-09-11T08:47:58.320

It makes some sense to call StoreLoad reordering an effect of the store buffer because the way to prevent it is with mfence or a locked instruction that drains the store buffer before later loads are allowed to read from cache. Merely serializing execution (with lfence) would not be sufficient, because the store buffer still exists. Note that even sfence ; lfence isn't sufficient.

Also I assume P5 Pentium (in-order dual-issue) has a store buffer, so SMP systems based on it could have this effect, in which case it would definitely be due to the store buffer. IDK how thoroughly the x86 memory model was documented in the early days before PPro even existed, but any naming of litmus tests done before that might well reflect in-order assumptions. (And naming after might include still-existing in-order systems.)

You can't tell which effect caused StoreLoad reordering. It's possible on a real x86 CPU (with a store buffer) for a later load to execute before the store has even written its address and data to the store buffer.

And yes, executing a store just means writing to the store buffer; it can't commit from the SB to L1d cache and become visible to other cores until after the store retires from the ROB (and thus is known to be non-speculative).

(Retirement happens in-order to support "precise exceptions". Otherwise, chaos ensues and discovering a mis-predict might mean rolling back the state of other cores, i.e. a design that's not sane. Can a speculatively executed CPU branch contain opcodes that access RAM? explains why a store buffer is necessary for OoO exec in general.)

I can't think of any detectable side-effect of the load uop executing before the store-data and/or store-address uops, or before the store retires, rather than after the store retires but before it commits to L1d cache.

You could force the latter case by putting an lfence between the store and the load, so the reordering is definitely caused by the store buffer. (A stronger barrier like mfence, a locked instruction, or a serializing instruction like cpuid, will all block the reordering entirely by draining the store buffer before the later load can execute. As an implementation detail, before it can even issue.)

A normal out of order exec treats all instructions as speculative, only becoming non-speculative when they retire from the ROB, which is done in program order to support precise exceptions. (See Out-of-order execution vs. speculative execution for a more in-depth exploration of that idea, in the context of Intel's Meltdown vulnerability.)

A hypothetical design with OoO exec but no store buffer would be possible. It would perform terribly, with each store having to wait for all previous instructions to be definitively known to not fault or otherwise be mispredicted / mis-speculated before the store can be allowed to execute.

This is not quite the same thing as saying that they need to have already executed, though (e.g. just executing the store-address uop of an earlier store would be enough to know it's non-faulting, or for a load doing the TLB/page-table checks will tell you it's non-faulting even if the data hasn't arrived yet). However, every branch instruction would need to be already executed (and known-correct), as would every ALU instruction like div that can.

Such a CPU also doesn't need to stop later loads from running before stores. A speculative load has no architectural effect / visibility, so it's ok if other cores see a share-request for a cache line which was the result of a mis-speculation. (On a memory region whose semantics allow that, such as normal WB write-back cacheable memory). That's why HW prefetching and speculative execution work in normal CPUs.

The memory model even allows StoreLoad ordering, so we're not speculating on memory ordering, only on the store (and other intervening instructions) not faulting. Which again is fine; speculative loads are always fine, it's speculative stores that we must not let other cores see. (So we can't do them at all if we don't have a store buffer or some other mechanism.)

(Fun fact: real x86 CPUs do speculate on memory ordering by doing loads out of order with each other, depending on addresses being ready or not, and on cache hit/miss. This can lead to memory order mis-speculation "machine clears" aka pipeline nukes (machine_clears.memory_ordering perf event) if another core wrote to a cache line between when it was actually read and the earliest the memory model said we could. Or even if we guess wrong about whether a load is going to reload something stored recently or not; memory disambiguation when addresses aren't ready yet involves dynamic prediction so you can provoke machine_clears.memory_ordering with single-threaded code.)

Out-of-order exec in P6 didn't introduce any new kinds of memory re-ordering because that could have broken existing multi-threaded binaries. (At that time mostly just OS kernels, I'd guess!) That's why early loads have to be speculative if done at all. x86's main reason for existence it backwards compat; back then it wasn't the performance king.

Re: why this litmus test exists at all, if that's what you mean?
Obviously to highlight something that can happen on x86.

Is StoreLoad reordering important? Usually it's not a problem; acquire / release synchronization is sufficient for most inter-thread communication about a buffer being ready to read, or more generally a lock-free queue. Or to implement mutexes. ISO C++ only guarantees that mutexes lock / unlock are acquire and release operations, not seq_cst.

It's pretty rare that an algorithm depends on draining the store buffer before a later load.

Say I somehow observed this litmus test on an x86 machine,

Fully working program that verifies that this reordering is possible in real life on real x86 CPUs: https://preshing.com/20120515/memory-reordering-caught-in-the-act/. (The rest of Preshing's articles on memory ordering are also excellent. Great for getting a conceptual understanding of inter-thread communication via lockless operations.)

Another great answer! This also provides a nice explanation for why x86 has the memory model that it does (all loads acquire, all stores release, but StoreLoad reordering still allowed): because that's the natural model of an in-order machine with a store buffer, which is what the P5 actually was. I never grasped that until now. — Nate Eldredge, Sep 09 '21 at 14:54
This excellent answer resolves all my confusion. It's so helpful and I really appreciate that. — zanmato, Sep 09 '21 at 18:41
BTW I brought the program in https://preshing.com/20120515/memory-reordering-caught-in-the-act/ and put an `lfence` between the store and the load. I still observed reordering but rarer than that of using no fences. This becomes the truly store buffer litmus test I believe. Thank you so much, Peter! — zanmato, Sep 09 '21 at 19:21
@zanmato: Cool, yeah. lfence is slow enough that as well as blocking OoO exec, it also gives some time for the store to retire and commit, but on a cache miss the store will still have to wait long enough. — Peter Cordes, Sep 09 '21 at 22:38
@NateEldredge: Note that an in-order pipeline can still scoreboard loads, only stalling if you actually try to read a load-result register that isn't ready yet. (So hit-under-miss can produce LoadLoad reordering.) Also, allowing the store buffer to commit out of order can produce StoreStore reordering (e.g. if the head of the buffer isn't exclusively owned, check older entries to see if they can commit). And LoadStore is possible with a cache-miss load and a store that commits quickly, if you don't forbid it. Typical in-order ARM CPUs like Cortex-A53 can do all of these things. — Peter Cordes, Sep 09 '21 at 22:43
@NateEldredge: I guess P5 Pentium doesn't do those things, perhaps intentionally for compat with 486, and/or perhaps for LoadLoad just because x86 code often uses memory source operands for ALU instructions, so they do use the result right away. Only mov loads could be scoreboarded. (Also not much register space for software pipelining). Some of those x86 memory model decisions date back earlier. Mentioning compat with what P5 does was kind of a throwaway idea / random thought, but now that you mention it, it's an interesting question. — Peter Cordes, Sep 09 '21 at 22:46
@NateEldredge: But yes I usually summarize x86-TSO as "program order + a store-buffer with store-forwarding". If 486 had a store buffer, that's presumably what it naturally was. — Peter Cordes, Sep 10 '21 at 10:46

Reason for the name of the "store buffer" litmus test on x86 TSO memory model

1 Answers1

Linked

Related