How is load->store reordering possible with in-order commit?

Question

ARM allows the reordering loads with subsequent stores, so that the following pseudocode:

// CPU 0 | // CPU 1 temp0 = x; | temp1 = y; y = 1; | x = 1;

can result in temp0 == temp1 == 1 (and, this is observable in practice as well). I'm having trouble understanding how this occurs; it seems like in-order commit would prevent it (which, it was my understanding, is present in pretty much all OOO processors). My reasoning goes "the load must have its value before it commits, it commits before the store, and the store's value can't become visible to other processors until it commits."

I'm guessing that one of my assumptions must be wrong, and something like one of the following must hold:

Instructions don't need to commit all the way in-order. A later store could safely commit and become visible before an earlier load, so long as at the time the store commits the core can guarantee that the previous load (and all intermediate instructions) won't trigger an exception, and that the load's address is guaranteed to be distinct from the store's.
The load can commit before its value is known. I don't have a guess as to how this would be implemented.
Stores can become visible before they are committed. Maybe a memory buffer somewhere is allowed to forward stores to loads to a different thread, even if the load was enqueued earlier?
Something else entirely?

There's a lot of hypothetical microarchitectural features that would explain this behavior, but I'm most curious about the ones that are actually present in modern weakly ordered CPUs.

You mean in-order *retirement*, right? Leaving the out-of-order core, but for a store the data can still be in the store buffer, not yet *committed* to L1d cache. (The convention I use of using the word "commit" only for store-buffer -> L1d may not be standard, but I find it very helpful to use different terms for local completion (retire from the ROB) vs. global visibility (commit to L1d). It matches Intel's terminology for transactional-memory commit vs. instruction retirement, but a quick google shows some papers apparently / confusingly using "commit" for both terms.) — Peter Cordes, Sep 07 '18 at 04:16
Yes, instruction retirement is what I'm thinking, thanks. (I think the ARM microarchitecture slides mostly call this commit as well, which may explain some of my terminology confusion). — Poscopia, Sep 07 '18 at 04:31
One way it can happen is cross-logcial core store forwarding between sibling cores in an SMT design. Both threads do their store first, and then each forwards from the others store which is in the store buffer but unretired. I don't know if such forwarding is common in real designs through because it would seem to tie the speculation of both threads together which seems undesirable. There aren't many ARM SMT designs so this probably doesn't explain your case. — BeeOnRope, Sep 07 '18 at 04:43
@BeeOnRope: I think in-order cores can do it easily. And BTW, this is a great question. I hadn't really realized before that my mental model of OoO exec made LoadStore reordering impossible, for the reasons outlined. Of course there's always weird stuff like Alpha's dependent-load reordering on a few uarches. ([Dependent loads reordering in CPU](https://stackoverflow.com/q/35115634)) — Peter Cordes, Sep 07 '18 at 04:54
The authors of [this paper](https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf) suggest that ARM can indeed commit stores out of order, before earlier loads have completed. See their claim and tests in section 7.1. Seems weird through! — BeeOnRope, Sep 07 '18 at 04:59

Peter Cordes · Accepted Answer · 2021-01-26T19:03:04.630

Your bullet points of assumptions all look correct to me, except that you could build a uarch where loads can retire from the OoO core after merely checking permissions (TLB) on a load to make sure it can definitely happen. There could be OoO exec CPUs that do that (update: apparently there are).

I think x86 CPUs require loads to actually have the data arrive before they can retire, but their strong memory model doesn't allow LoadStore reordering anyway. So ARM certainly could be different.

You're right that stores can't be made visible to any other cores before retirement. That way lies madness. Even on an SMT core (multiple logical threads on one physical core), it would link speculation on two logical threads together, requiring them both to roll back if either one detected mis-speculation. That would defeat the purpose of SMT of having one logical thread take advantage of stalls in others.

(Related: Making retired but not yet committed (to L1d) stores visible to other logical threads on the same core is how some real PowerPC implementations make it possible for threads to disagree on the global order of stores. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?)

CPUs with in-order execution can start a load (check the TLB and write a load-buffer entry) and only stall if an instruction tries to use the result before it's ready. Then later instructions, including stores, can run normally. This is basically required for non-terrible performance in an in-order pipeline; stalling on every cache miss (or even just L1d latency) would be unacceptable. Memory parallelism is a thing even on in-order CPUs; they can have multiple load buffers that track multiple outstanding cache misses. High(ish) performance in-order ARM cores like Cortex-A53 are still widely used in modern smartphones, and scheduling loads well ahead of when the result register is used is a well-known important optimization for looping over an array. (Unrolling or even software pipelining.)

So if the load misses in cache but the store hits (and commits to L1d before earlier cache-miss loads get their data), you can get LoadStore reordering. (Jeff Preshing intro to memory reording uses that example for LoadStore, but doesn't get into uarch details at all.)

A load can't fault after you've checked the TLB and / or whatever memory-region stuff for it. That part has to be complete before it retires, or before it reaches the end of an in-order pipeline. Just like a retired store sitting in the store buffer waiting to commit, a retired load sitting in a load buffer is definitely happening at some point.

So the sequence on an in-order pipeline is:

lw r0, [r1] TLB hit, but misses in L1d cache. Load execution unit writes the address (r1) into a load buffer. Any later instruction that tries to read r0 will stall, but we know for sure that the load didn't fault.

With r0 tied to waiting for that load buffer to be ready, the lw instruction itself can leave the pipeline (retire), and so can later instructions.
any amount of other instructions that don't read r0. That would stall an in-order pipeline.
sw r2, [r3] store execution unit writes address + data to the store buffer / queue. Then this instruction can retire.

Probing the load buffers finds that this store doesn't overlap with the pending load, so it can commit to L1d. (If it had overlapped, you couldn't commit it until a MESI RFO completed anyway, and fast restart would forward the incoming data to the load buffer. So it might not be too complicated to handle that case without even probing on every store, but let's only look at the separate-cache-line case where we can get LoadStore reordering)

Committing to L1d = becoming globally visible. This can happen while the earlier load is still waiting for the cache line to arrive.

For OoO CPUs, you'd need some way to tie load completion back into the OoO core for instructions waiting on the load result. I guess that's possible, but it means that the architectural/retirement value of a register might not be stored anywhere in the core. Pipeline flushes and other rollbacks from mis-speculation would have to hang on to that association between an incoming load and a physical and architectural register. (Not flushing store buffers on pipeline rollbacks is already a thing that CPUs have to do, though. Retired but not yet committed stores sitting in the store buffer have no way to be rolled back.)

That could be a good design idea for uarches with a small OoO window that's too small to come close to hiding a cache miss. (Which to be fair, is every high-performance OoO exec CPU: memory latency is usually too high to fully hide.)

We have experimental evidence of LoadStore reordering on an OoO ARM: section 7.1 of https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf shows non-zero counts for "load buffering" on Tegra 2, which is based on the out-of-order Cortex-A9 uarch. I didn't look up all the others, but I did rewrite the answer to suggest that this is the likely mechanism for out-of-order CPUs, too. I don't know for sure if that's the case, though.

What if the load or any other instruction older than the store faults though? Then the store has incorrectly been made visible to other threads. Or do these archs not have precise faults? — BeeOnRope, Sep 07 '18 at 05:02
@BeeOnRope: A load can't fault after you've checked the TLB and / or whatever memory-region stuff for it. That part has to be complete before it retires, or before it reaches the end of an in-order pipeline. Just like a retired store sitting in the store buffer waiting to commit, a retired load sitting in a load buffer is definitely happening at some point. — Peter Cordes, Sep 07 '18 at 05:08
@BeeOnRope: updated to put more of what was in my head into text. You probably weren't the only person who didn't grok my shorter explanation. — Peter Cordes, Sep 07 '18 at 05:40
Nice to know the uarch details about this load/store reordering. I wonder if this mechanism, i.e., a load can retire immediately after it checks the TLB and turns out to be not faulting, can cause load/load reordering as well? Say a prior non-faulting load meets a cache miss (hence waits in the load buffer), and a later non-faulting load is a cache hit (hence all subsequent instructions can proceed). — zanmato, Dec 14 '21 at 16:47
@zanmato: LoadLoad reordering is already possible without this, just via OoO exec, e.g. a cache-hit load can take its value while an older load is still waiting for data to arrive. (Or an in-order CPU with hit-under-miss capability.) But yes, letting non-faulting loads retire while they're still waiting for data gives even more time for later loads to arrive ahead of them, whether that's by hitting in some closer level of cache or just not being delayed as much by contention waiting for another core to share the line. — Peter Cordes, Dec 14 '21 at 20:09
@PeterCordes But how do other OoO architectures, like x86, prohibit load/load reordering? I guess the reason is the different timing of load retiring (i.e., when data actually arrives). So OoO itself doesn't necessarily mean load/load reordering is possible? — zanmato, Dec 22 '21 at 10:59
@zanmato: x86 (Intel at least, presumably AMD) does do *speculative* LoadLoad reordering, and confirms on retirement(?) that the cache line hasn't been invalidated, so it's allowed to pretend that the load architecturally happened now and got the same value. Prohibiting LoadLoad reordering is one reason why reading shared data sometimes results in `machine_clears.memory_ordering` events. See [What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?](https://stackoverflow.com/q/45602699) — Peter Cordes, Dec 22 '21 at 20:53
@PeterCordes So ARMv7/Power doesn't "confirm" on retirement, thus exhibits load/load reordering and differs from x86? — zanmato, Dec 23 '21 at 04:59
@zanmato: right, of course. The load is architecturally allowed to happen early, so there's no need to keep tracking its memory order after it produces a value. (Unless an ARMv8 design tries to do anything speculatively around LDAR acquire loads, but probably not a good idea because that's a good signal that other cores probably *are* accessing this and other cache lines.) — Peter Cordes, Dec 23 '21 at 05:05
@PeterCordes an obvious question following the discussion - how do these weak uarchs recover correct LoadStore ordering (say as part of ARM `dmb`)? Do they stall waiting for the load buffers to flush, similar to a store buffer flush? I guess not really, as it needs to be a relatively light weight operation. — Daniel Nitzan, Mar 26 '22 at 21:51
@DanielNitzan: The more order-tracking they do at all times, the cheaper a barrier instruction can be. If the CPU can't track the relative program-order of various in-flight loads / stores and stop them from happening in the wrong order, it would I guess have to wait for all earlier in-flight loads to retire before letting any later stores retire (and be eligible to commit). (AArch64 I think has some fine-grained `dmb` barriers, but on ARM32 I've only ever seen GCC / clang use `dmb ish` full barriers. I assume we're talking about just a LoadStore barrier insn, even if hypothetically.) — Peter Cordes, Mar 26 '22 at 21:58
@PeterCordes "With r0 tied to waiting for that load buffer to be ready, the lw instruction itself can leave the pipeline (retire), and so can later instructions." So the load instruction can retire before the desired value are load into load buffer? This is very interesting. Where can I find some papers or documents about this process, or this is just your reasoning. — haolee, May 09 '22 at 14:41
@haolee: It's my reasoning, but I think it's necessary to explain the fact that LoadStore reordering is possible on real-world ARM CPUs. Store commit is definitely after store retirement, so for an older load to not take a value until after that means it's definitely taking a value from coherent cache *somehow*, after the load retired. In-order retirement is also a known fact. — Peter Cordes, May 09 '22 at 14:46
@PeterCordes It sounds reasonable. These days I skim through almost every possible document on Google but can't find any guides or sheets about the relationship between load buffer and retirements. It's a pity that the CPU manufacturer doesn't share these details. — haolee, May 09 '22 at 15:46

How is load->store reordering possible with in-order commit?

1 Answers1

Linked