Cache coherence literature generally only refers store buffers but not read buffers. Yet one somehow needs both?

Question

When reading about consistency models (namely the x86 TSO), authors in general resort to models where there are a bunch of CPUs, their associated store buffers and their private caches.

If my understanding is correct, store buffers can be described as queues where CPUs may put any store instruction they want to commit to memory. So as the name states, they are store buffers.

But when I read those papers, they tend to talk about the interaction of loads and stores, with statements such as "a later load can pass an earlier store" which is slightly confusing, as they almost seem to be talking as if the store buffer would have both loads and stores, when it doesn't -- right?

So there must be also be a load store that they are not (at least explicitly) talking about. Plus, those two must be somehow synchronized, so both know when it's acceptable to load from memory and to commit to memory -- or am I missing something?

Can anyone shed some more light into this?

EDIT:

Let's look at a paragraph out of "A primer on memory consistency and cache coherence":

To understand the implementation of atomic RMWs in TSO, we consider the RMW as a load immediately followed by a store. The load part of the RMW cannot pass earlier loads due to TSO’s ordering rules. It might at first appear that the load part of the RMW could pass earlier stores in the write buffer, but this is not legal. If the load part of the RMW passes an earlier store, then the store part of the RMW would also have to pass the earlier store because the RMW is an atomic pair. But because stores are not allowed to pass each other in TSO, the load part of the RMW cannot pass an earlier store either

more specifically,

The load part of the RMW cannot pass earlier loads due to TSO’s ordering rules. It might at first appear that the load part of the RMW could pass earlier stores in the write buffer

so they are referring to loads / stores crossing each other in the write buffer (which I assume is the same thing as the store buffer?)

Thanks

Load buffers don't cause reordering. They wait for data that hasn't arrived yet; the load finishes "executing" when it reads data. Store buffers are fundamentally different; they hold data for some time before it becomes globally visible. — Peter Cordes, May 09 '19 at 00:03

Peter Cordes · Accepted Answer · 2019-05-09T00:40:12.760

8

Yes, write buffer = store buffer.

They're talking about if an atomic RMW was split up into a separate load and store, and the store buffer delayed another store (to a separate address) so it was after the load but still before the store.

Obviously that would make it non-atomic, and violate the requirement that all x86 atomic RMW operations are also full barriers. (The lock prefix implies that, too.)

Normally it would be hard for a reader to detect that, but if the "separate address" was contiguous with the atomic RMW, then e.g. a dword store + a dword RMW could be observed by another thread doing a 64-bit qword load of both as one atomic operation.

re: the title question:

Load buffers don't cause reordering. They wait for data that hasn't arrived yet; the load finishes "executing" when it reads data.

Store buffers are fundamentally different; they hold data for some time before it becomes globally visible.

x86's TSO memory model can be described as sequential-consistency + a store-buffer (with store-forwarding). See also x86 mfence and C++ memory barrier and comments on that answer for more discussion about the fact that merely allowing StoreLoad reordering is not a sufficient description for cases where a thread reloads data that it just stored, especially if a load partially overlaps with recent stores so the HW merges data from the store buffer with data from L1d to complete the load before the store is globally visible.

Also note that x86 CPUs speculatively do reorder loads (at least Intel's do), but shoot down the mis-speculation to preserve the TSO memory model of no LoadLoad or LoadStore reordering. CPUs thus have to track loads vs. store ordering. Intel calls the combined store+load buffer tracking structure the "memory order buffer" (MOB). See Size of store buffers on Intel hardware? What exactly is a store buffer? for more.

edited May 09 '19 at 00:40

answered May 09 '19 at 00:23

Peter Cordes

328,167
45
605
847

"x86's TSO memory model can be described as sequential-consistency + a store-buffer (with store-forwarding)" Let's assume for a second that Intel CPUs would not have a store-buffer (so they'd block when writing to the cache). Would then multithreaded applications run on Intel CPUs be sequentially consistent? My understand is that the answer is no, as the CPU itself may run instructions OoO (or does it only do it in ways such that the effective load/stores always follow program order?) – devoured elysium May 09 '19 at 01:59
1

@devouredelysium: Out-of-order execution preserves as much illusion as the memory model requires of loads/stores running in program order. There's a whole paragraph in my answer about how loads are *speculatively* executed out of order, but with a memory-order mis-speculation pipeline rollback to retirement state if it detects that the difference mattered. So yes, with no store buffer it would be sequentially consistent, and performance would be absolute garbage because stores couldn't execute until they were ready to retire. A store buffer is *essential* to decouple exec from commit. – Peter Cordes May 09 '19 at 02:15
1

And to allow *speculative* execution of stores, before you find out if a previous instruction took an exception, or if an earlier branch was actually mis-predicted. You can't roll back after you commit a store into L1d cache: at that point it's globally visible and other threads could already be using the data. So store commit to L1d can't be done speculatively, and has to wait for it to be known correct. – Peter Cordes May 09 '19 at 02:17
Argh, I really need to re-read some basic computer architecture stuff.. – devoured elysium May 09 '19 at 03:45
1

@devouredelysium: my answer on [Size of store buffers on Intel hardware? What exactly is a store buffer?](//stackoverflow.com/q/54876208) has some useful links, and tries to present some basic concepts in the answer itself. The most important thing to remember with OoO exec is that you can speculate all you want inside one core, but anything that can become part of what other cores are doing must be solid or else you could need to roll back multiple cores. (And then complexity spirals out of control; nobody does that.) OoO exec treats *everything* as speculative until retirement. – Peter Cordes May 09 '19 at 03:49
I have been digging into how X86 prevents certain reorderings. Prevention of StoreStore and LoadLoad reordering is clear; but how is LoadStore reordering prevented? – pveentjer May 06 '20 at 06:57
@pveentjer: That happens for free because stores can't commit from the store buffer to L1d cache until after the store instruction retires from the out-of-order back-end (known to be non-speculative). And load instructions can't retire until they've taken a value from cache (executed). Some weakly-ordered ISAs can allow load instructions to retire after checking the load won't fault, but before the load data actually arrives, like how in-order ISAs can scoreboard loads to avoid stalling until you actually try to read the result. I assume x86 microarchitectures simply don't do that. – Peter Cordes May 06 '20 at 07:01
@pveentjer: preventing LoadLoad reordering is actually the hard one; high-performance x86 CPUs use a Memory Order Buffer to track load and store ordering, and do a memory-ordering machine clear if speculative early loads violated the architectural load-ordering rules. – Peter Cordes May 06 '20 at 07:05
@PeterCordes Yes. I have been studying this part the last few days. Went through many answers from you, BeeOnRope, margeret etc. Extremely helpful. Only the LoadStore reordering remained unclear. – pveentjer May 06 '20 at 07:09
@pveentjer: see also [How is load->store reordering possible with in-order commit?](https://stackoverflow.com/q/52215031) - I thought I remembered writing an answer about in-order retirement preventing LoadStore for free, depending on when you let a load retire. :P – Peter Cordes May 06 '20 at 07:20
If my understanding is correct: when loads and stores are retired, they are retired in program order (the order in the ROB). So it isn't that loads in the LB retire independently of stores in the SB (there is a total order.. not a partial order). I'm pretty sure this is the case but I want to make sure my understanding is correct. – pveentjer May 07 '20 at 03:19
@pveentjer: LB entries and SB entries don't retire on their own. Load and store *uops* retire from the ROB (in program order). Loads free the associated LB entry at that point. Because AFAIK a load has to be fully completed for it to be considered to have executed successfully on x86 uarches, unlike some ARM(?). But stores are fundamentally different: they have to wait until they're non-speculative to make themselves visible to other cores, so SB entries *can't* commit to L1d cache (and be freed) until *after* the store uop retires from the ROB. (aka a "graduated" store). – Peter Cordes May 07 '20 at 06:20
Thanks. Then my understanding is correct and I understand why the LoadStore reordering is not an issue. – pveentjer May 07 '20 at 07:14
@PeterCordes are there any other examples of violating atomic behavior on a RMW apart from the dword store + a dword RMW example you gave above? – pveentjer May 03 '21 at 14:33
@pveentjer: Literally any two separate operations that you do separately do *not* form an atomic RMW. That's the whole point of that initial section of the answer. e.g. `a = 1; b = 2;` can be observed to have partially happened (by an observer using a wide atomic load that reads both a and b together.) Or a store to `a` and a load of `b`, or any two unrelated things aren't an atomic transaction (unless you use TSX to make them part of a transaction). – Peter Cordes May 03 '21 at 14:38
I don't think my question was very clear. Let me explain what I'm failing to understand. So imagine there is a RMW on A. And in the SB there is already a store of B (different cacheline). For the atomic RMW it is required to wait for the SB to be drained before starting with the RMW. One advantage is that when the write in the RMW executed, it can directly write to the L1D since the cacheline is already in the right state and since the SB is drained, it won't lead to the stores from being reordered (otherwise the store of B could become globally visible after the store of A). More coming... – pveentjer May 03 '21 at 14:47
For discussion sake: what if the store from the RMW would write to the SB and we would just wait for the SB to be drained after the RMW 'completes'. What is the purpose of the waiting for the SB to be drained before the RMW begins. I understand in this case that the store of B could be reordered with the load of RMW of A. But how bad is this? It isn't atomic, but is there an example that demonstrates it isn't atomic? – pveentjer May 03 '21 at 14:48
@pveentjer: x86 drains the SB before atomic RMWs in order to enforce store ordering. A weakly-ordered ISA could just bypass the store buffer and atomic-RMW L1d cache (after snooping the SB to pending stores that overlap), if the RMW instruction didn't have a release or seq-cst ordering. – Peter Cordes May 03 '21 at 15:05
@pveentjer: If you just put a store into the SB, this core could lose ownership of the line and let another core read and write before we regain ownership and commit the store. . So e.g. `a.fetch_add(1)` could lose counts if multiple threads were doing it. Or the non-atomic RMW could step on a plain store, so the final value visible when the dust settles is `old+1`, which should be impossible after both `a++` and `a=1234` execute, assuming old isn't 1233. [Can num++ be atomic for 'int num'?](https://stackoverflow.com/q/39393850) – Peter Cordes May 03 '21 at 15:07
'x86 drains the SB before atomic RMWs in order to enforce store ordering'. That is what the store buffer on x86 already does. We don't need a drain the store buffer for that. – pveentjer May 03 '21 at 15:35
This post addresses my question. For the 'atomic' behavior of an atomic rmw, the full fence isn't needed. It is required for the ordering constraints that are also part of the lock requirements. You already gave very useful answers to that post @PeterCordes; so thanks for that. https://stackoverflow.com/questions/60332591/why-is-lock-a-full-barrier-on-x86 – pveentjer May 04 '21 at 05:40
@pveentjer: re: store ordering: An atomic RMW has to read and write L1d at (logically) the same time. For that to happen, and the store part of the RMW to obey ordering rules, the SB must be drained first. (x86 atomic RMWs (`lock` prefix) are also full barriers, and drain the SB *before* the load operation. This allows sequential consistency, allowing you to delay the load until other cores are seeing what your load is seeing. Before SSE2 `mfence`, seq_cst might not have been possible without atomic RMWs draining the SB before they operate. Probable what Hadi said on the linked answer) – Peter Cordes May 04 '21 at 05:45

Cache coherence literature generally only refers store buffers but not read buffers. Yet one somehow needs both?

1 Answers1

Linked