Why doesn't RFO after retirement break memory ordering?

Question

I thought that I understood how L1D write miss is handled, but thinking carefully about it made me confused.

Here is an assembly language fragment:

;rdi contains some valid 64-bytes aligned pointer
;rsi contains some data
mov [rdi], rsi
mov [rdi + 0x40], rsi        
mov [rdi + 0x20], rsi

Assume that [rdi] and [rdi + 0x40] lines are not in the Exclusive or Modified state in l1d. Then I can imagine the following sequence of actions:

mov [rdi], rsi retires.
mov [rdi], rsi tries to write data into l1d. RFO is initiated, data is placed into WC buffer.
mov [rdi + 0x40], rsi retires (mov [rdi], rsi already retired, so it's possible)
mov [rdi + 0x40], rsi initiates RFO for the consecutive cache line, data is placed into WC buffer.
mov [rdi + 0x20], rsi retires (mov [rdi + 0x40], rsi already retired so it is possible)
mov [rdi + 0x20], rsi notices that there is RFO for [rdi] in progress. The data is placed into WC buffer.
BOOM! [rdi] RFO is happened to finish before [rdi + 0x40] RFO so the data of mov [rdi], rsi and mov [rdi + 0x20], rsi can now be commited to the cache. It breaks memory ordering.

How is such case handled to maintain correct memory ordering?

Maybe I'm missing something, but... how does this break memory ordering? In C, with the variables `int64_t rdi[16]` and `int64_t rsi` the code might look like this: `rdi[0] = rsi; rdi[8] = rsi; rdi[4] = rsi;`. But the compiler and the CPU are allowed to reorder these instructions as they see fit, aren't they? — jcsahnwaldt Reinstate Monica, Aug 22 '23 at 20:04
Never mind... I've since read https://preshing.com/20120930/weak-vs-strong-memory-models/ (found in a comment by Peter Cordes on https://stackoverflow.com/questions/67595683/) and learned that my mental memory model was based on C and Java, but x86's model is much stronger. — jcsahnwaldt Reinstate Monica, Aug 22 '23 at 21:47

Peter Cordes · Accepted Answer · 2020-06-17T19:39:00.793

7

Starting an RFO can be separate from placing the store data into an LFB; e.g. starting RFOs early for entries that aren't yet at the head of the store buffer can allow memory-level parallelism for stores. What you've proved is that for that to happen, store data can't always move into an LFB (Line Fill Buffer, also used for NT / WC stores).

If an RFO could only happen by moving store data from the store buffer (SB) into an LFB, then yes, you could only RFO for the head of the SB, not in parallel for any graduated entry. (A "graduated" store is one whose uops have retired from the ROB, i.e. become non-speculative). But if you don't have that requirement, you could RFO even earlier, even speculatively, but you probably wouldn't want to.¹

(Given @BeeOnRope's findings about how multiple cache-miss stores to the same line can commit into an LFB, and then another LFB for another line, this might be the mechanism for having multiple RFOs in flight, not just the SB head. We'd have to check if an ABA store pattern limited memory-level parallelism. If that's the case, then maybe starting an RFO is the same as moving the data from the SB to an LFB, freeing that SB entry. But note that the new head of the SB still couldn't commit until those pending RFOs complete and commit the stores from the LFBs.)

A simple mental model that's pretty close to reality

On a store miss, the store buffer entry holds the store data until the RFO is complete, and commits straight into L1d (flipping the line from Exclusive to Modified state). Strong ordering is ensured by in-order commit from the head of the store buffer².

As @HadiBrais wrote in answer to Where is the Write-Combining Buffer located? x86

My understanding is that for cacheable stores, only the RFO request is held in the LFB, but the data to be store waits in the store buffer until the target line is fetched into the LFB entry allocated for it. This is supported by the following statement from Section 2.4.5.2 of the Intel optimization manual:

The L1 DCache can maintain up to 64 load micro-ops from allocation until retirement. It can maintain up to 36 store operations from allocation until the store value is committed to the cache, or written to the line fill buffers (LFB) in the case of non-temporal stores.

This is pretty much fine for thinking about performance tuning, but probably not MDS vulnerabilities that can speculatively use stale data that faulting loads read from an LFB or whatever.

Any store coalescing or other tricks must necessarily respect the memory model.

But is it that simple? No

We know CPUs can't violate their memory model, and that speculation + roll back isn't an option for commit to globally-visible state like L1d, or for graduated stores in general because the uops are gone from the ROB. They've already happened as far as local OoO exec is concerned, it's just a matter of when they'll become visible to other cores. Also we know that LFBs themselves are not globally visible. (There's some indication that LFBs are snooped by loads from this core, like the store buffer, but as far as MESI states they're more like an extension of the store buffer.)

@BeeOnRope has done some more experiments, finding some evidence that a series of stores like AAABBCCCC can drain into three LFBs, for lines A, B, C. RWT thread with an experiment that demonstrates a 4x perf difference predicted by this theory.

This implies that the CPU can track order between LFBs, although still not within a single LFB of course. A sequence like AAABBCCCCA (or ABA) would not be able to commit past the final A store because the "current head" LFB is for line C, and there's already an LFB waiting for line A to arrive. A 4th line (D) would be ok, opening a new LFB, but adding to an already-open LFB waiting for an RFO that isn't the head is not ok. See @Bee's summary in comments.

All of this is only tested for Intel CPUs, AFAIK.

Previous to this, we thought there was no store coalescing on Intel/AMD, but have long been puzzled by hints in Intel manuals about LFBs acting as WC buffers for stores to normal (strongly ordered) WB memory

(This section not updated in light of @BeeOnRope's new discovery).

There's also no solid evidence of any kind of store merging / coalescing in the store buffer on modern Intel or AMD CPUs, or of using a WC buffer (LFB on Intel) to hold store data while waiting for a cache line to arrive. See discussion in comments under Are two store buffer entries needed for split line/page stores on recent Intel?. We can't rule out some minor form of it near the commit end of the store buffer.

We know that some weakly-ordered RISCs microarchitectures definitely do merge stores before they commit, especially to create a full 4-byte or 8-byte write of a cache ECC granule to avoid an RMW cycle. But Intel CPUs don't have any penalty for narrow or unaligned stores within a cache line.

For a while @BeeOnRope and I thought there was some evidence of store coalescing, but we've changed our minds. Size of store buffers on Intel hardware? What exactly is a store buffer? has some more details (and links to older discussions).

(Update: and now there is finally evidence of store coalescing, and an explanation of a mechanism that makes sense.)

Footnote 1: An RFO costs shared bandwidth and steals the line from other cores, slowing them down. And you might lose the line again before you get to actually commit into it if you RFO too early. LFBs are also needed for loads, which you don't want to starve (because execution stalls when waiting for load results). Loads are fundamentally different from stores, and generally prioritized.

So waiting at least for the store to graduate is a good plan, and maybe only initiating RFOs for the last few store-buffer entries before the head. (You need to check if L1d already owns the line before starting an RFO, and that takes a cache read port for at least the tags, although not data. I might guess that the store buffer checks 1 entry at a time and marks an entry as likely not needing an RFO.) Also note that 1 SB entry could be a misaligned cache-split store and touch 2 cache lines, requiring up to 2 RFOs...

Footnote 2: Store buffer entries are allocated in program order (at the tail of the buffer), as instructions / uops are issued into the out-of-order back end and have back-end resources allocated for them. (e.g. a physical register for uops that write a register, a branch-order-buffer entry for conditional branch uops that might mispredict.) See also Size of store buffers on Intel hardware? What exactly is a store buffer?. In-order alloc and commit guarantee program-order visibility of stores. The store buffer insulates globally-visible commit from out-of-order speculative execution of store-address and store-data uops (which write store-buffer entries), and decouples execution in general from waiting for cache-miss stores, until the store buffer fills up.

PS Intel calls the store buffer + load buffers collectively the memory order buffer (MOB), because they need to know about each other to track speculative early loads. This isn't relevant to your question, only for the case of speculative early loads and detecting memory-order mis-speculation and nuking the pipeline.

For retired store instructions (more specifically their "graduated" store buffer entries), it is just the store buffer that has to commit to L1d in program order.

edited Jun 17 '20 at 19:39

answered Jun 14 '20 at 20:27

Peter Cordes

328,167
45
605
847

Very interesting. I used to think that ROB was the only one who tracked that uops are retired in order and that's why the original question arose. Also I did not think of LB + SB = MOB. – Some Name Jun 14 '20 at 21:23
I did a quick search for MOB in architecture manual and found that MOB **Ensures loads and stores follow memory ordering rules of the Intel 64 and IA-32 architectures.** which means that that's MOB responsibility to follow x86 memory ordering – Some Name Jun 14 '20 at 21:24
1

@SomeName: yes, exactly. It's up to the MOB to detect memory-order mis-speculation and trigger a pipeline nuke. But note that the answer to your question doesn't involve ordering stores relative to loads; waiting until post-retirement to commit stores for correctness gives us LoadStore ordering for free (assuming loads have to actually complete to retire, not just be checked for non-faulting). So the combined load+store buffer MOB aspect is irrelevant for this specific question, just in-order commit for store ordering from the SB itself. – Peter Cordes Jun 14 '20 at 21:35
2

I have changed my mind on this again. I believe stores that miss go into the LFB while the RFO is in progress _under certain conditions_. In particular, the conditions are that ordering is not violated. Ordering will be violated if a store would drain into an LFB that was already allocated for an earlier non-contiguous store miss, so in this case there is a stall. E.g., if A, B, C represent stores to different cache lines A, B, C, a series of stores like AAABBCCCC can drain into three LFBs, for lines A, B, C. – BeeOnRope Jun 16 '20 at 14:10
2

The CPU just has to make sure to commit the LFBs in order, A, B, C. However, in the sequence, AAABBCCCCA, (or more simply ABA) the final store can't go into the open LFB, it would lose the store-store ordering property. The ABA case is exactly the same as the OP's `[+ 0, + 0x40, + 0x20]` example. So it stalls: probably the store waits in the store buffer. Performance tests are consistent with this theory, but don't prove it. – BeeOnRope Jun 16 '20 at 14:13
2

I wrote recently about my new view [on RWT](https://www.realworldtech.com/forum/?threadid=173441&curpostid=192262), and use the same 0, 40, 20 test as the OP. @SomeName perhaps this question was motivated from that post? You can find the test in the [wip branch](https://github.com/travisdowns/bimodal-performance/commits/wip) of the bimodal performance test, they are called `write_aabb` and `write_abab`. – BeeOnRope Jun 16 '20 at 14:16
@BeeOnRope The actual origin of the question was when I was researching how non-temporal stores interact with the wc buffer and remembered that regular WB stores also use WC buffer on write miss. – Some Name Jun 17 '20 at 09:30
@BeeOnRope Searching through the Intel Man I found that the the problem is described as the store does not go into WC buffers. **When a write to a write-combining buffer for a previously-unwritten cache line occurs, there will be a read-for-ownership (RFO). If a subsequent write happens to another write-combining buffer, a separate RFO may be caused for that cache line. Subsequent writes to the first cache line and write-combining buffer will be delayed until the second RFO has been serviced to guarantee properly ordered visibility of the writes.** – Some Name Jun 17 '20 at 09:31
@BeeOnRope _So it stalls: probably the store waits in the store buffer._ According Peter's answer this what my current picture of the world is. Intel man was not clear about the actual mechanics behind "WB store does not go into WC when RFO in progress" though. – Some Name Jun 17 '20 at 09:32
@SomeName - yes, this text in the ORM is part of the evidence in favor of what I describe here in the comments, in fact it is basically describing the ABA problem. I didn't understand your last comment. What is your current view? This answer as currently written is different than my view. This answer says "we think pending writes don't drain out of the store buffer into the WC/LFB while waiting for RFO", while my current view is "they do, except when it violates ordering". I think the ORM paragraph supports the latter view. – BeeOnRope Jun 17 '20 at 15:08
@BeeOnRope: Finally a store-coalescing mechanism that makes sense. It would imply loads snooping LFBs, but we already thought that might be a thing. Nice job cooking up an experiment to test it. Rewrote large parts of my answer. – Peter Cordes Jun 17 '20 at 19:40
1

@SomeName: made a major update to my answer in light of Bee's research. – Peter Cordes Jun 17 '20 at 19:41
"It would imply loads snooping LFBs" - maybe I didn't get it, but this would happen totally naturally? That is, conceptually (ignoring optimizations) loads first snoop the store buffer, if that misses, they check the L1D, if that misses, they allocate an LFB or use an existing LFB for that line. I think this would fall squarely into the "or use an existing LFB case". Perhaps the order of the LFB/L1D checking could be reversed? It would be interesting to see if there is a detectable latency change in this case. – BeeOnRope Jun 17 '20 at 21:07
2

"Nice job cooking up an experiment to test it" .... well actually I feel I haven't tested it directly. There is the ABAB vs AABB test, but I guess that could have other explanations. I'm planning a more direct test which checks it without triggering the ABA thing, e.g., checking whether a long stream of misses to the same line appear to drain, but I haven't written it yet. – BeeOnRope Jun 17 '20 at 21:09
@BeeOnRope: Yes, it would be perfectly plausible to have a design where loads didn't need to snoop the LFBs. I was thinking some of our previous discussions might have concluded that they'd have to or probably did, though. And maybe I was mixing up the case of `movntdqa` loads on WC memory which read only from an LFB, not from L1d. But that could take a totally different path so we shouldn't assume anything about WB loads and store-forwarding based on that. Anyway, forget I said we already thought that. If this explanation for your test results holds up, it pretty much proves loads snoop LFB – Peter Cordes Jun 17 '20 at 21:11
@PeterCordes - maybe we are disagreeing on the meaning of "snoop", but I don't understand how loads could not snoop the LFBs. Regardless of how stores are implemented, a load that misses in L1D has to check the LFB for a "hit" (a hit doesn't mean the data is there, the LFB could just be in progress), since it needs to allocate a new LFB is there isn't one already there. – BeeOnRope Jun 17 '20 at 21:18
Maybe by "snoop the LFBs" you mean specifically the scenario where the value is actually retrieved from a "partially filled" LFB which has some bytes filled in (by stores) but is waiting for the rest of the data to arrive? – BeeOnRope Jun 17 '20 at 21:20
1

@BeeOnRope: I meant actually being able to store-forward data from an LFB, yes. Not just trigger a flush of NT stores in a partial line, or tack itself on to a pending load-miss LFB after missing in L1d. (BTW, reloading an NT store flushed to memory before reloading, right? That probably means NT stores can ignore the ABA ordering restriction when leaving the store buffer for an LFB.) Anyway, partial NT stores are one reason that loads would need to at least probe LFB addresses, regardless of how normal stores use them. – Peter Cordes Jun 17 '20 at 21:27
@PeterCordes - yes, I see what you mean. It's kind of interesting because for store-forwarding I would normally think of as happening before the L1D check, but here it _could_ happen after the L1D check (since that will always miss in this case), yet is still logically a store forwarding case (including wrt to memory ordering since these stores aren't GO yet). – BeeOnRope Jun 17 '20 at 21:44
@BeeOnRope: I'd assume that in silicon, all 3 paths are tried in parallel (store buffer, LFBs, and L1d), and merged based on which ones come up with results. e.g. store-forward from LFB or SB would take precedence over an L1d hit. Checking LFBs and SB in the same cycle would preclude finding the same line in both, or worse missing it, if it was moving from SB to LFB as you check. – Peter Cordes Jun 17 '20 at 21:49
I think there might be other cases where values come from an LFB, e.g., when a line is being evicted from the L1D, if these use the LFBs (not sure about it), then I guess there are scenarios where the value might come from the LFB, although in this case it is already GO. Yes, I think NT stores bypass all these rules: patents describe how LFBs are in WC or non-WC mode, depending, and the behavior is quite different. – BeeOnRope Jun 17 '20 at 22:11
@PeterCordes about stores you said "(A "graduated" store is one whose uops have retired from the ROB, i.e. become non-speculative). But if you don't have that requirement, you could RFO even earlier, even speculatively, but you probably wouldn't want to." do non-retired loads also prefetch RFO (because there is no memory order concern there)? Is it aggressive enough to cause adverse coherency traffic if you have a lot of loads hidden behind conditions? – Noah Jan 05 '21 at 19:46
1

@Noah: Loads are fundamentally different from stores: they have to access cache/memory to provide data for later instructions that use the load result. So yes, as soon as they execute, they either take data from L1d cache or set up a request for that cache line on a miss. (They also don't need exclusive ownership, so they wouldn't send an RFO ("for ownership"), just a plain read. If no other core had the line, then the L3 tags or whatever snoop filtering mechanism notices that and gives exclusive ownership anyway, otherwise sends a share request to the core that currently has it exclusive) – Peter Cordes Jan 06 '21 at 01:45
@PeterCordes could out of order execution cause the store buffer coalescing to fail? i.e ```write (A)```, ```write (A + reg)```, ```write B``` assuming ```A``` and ```A + reg``` are in the same cache line but ```reg``` is not ready yet after ```write (A)``` the store buffer would be ABA. Would this still fill the LFB as AAB because stored instructions are retired to the LFB in program order or could the LFB end up as ABA and cause a stall? – Noah Jan 13 '21 at 01:06
1

@Noah: No. Store-buffer entries are allocated during issue/rename/allocate, like physical regs for uops with reg outputs, and other execution resources uops will need. Thus, the store-buffer can always be read in program order, regardless of how it was filled. Remember that commit to L1d is only even considering entries whose store uops have retired from the ROB and are thus non-speculative, and this process restores program order. This means commit doesn't have to know anything about execution order to preserve the program-order commit required by the x86 TSO memory model. – Peter Cordes Jan 13 '21 at 05:20
@PeterCordes is the ```AAABBBCCC``` optimization only available when the store misses L1 cache? I.e will the stores only coalesce in the LFB or do they coalesce in the store buffer first? (looking at steps 4 & 5 in your post [here](https://stackoverflow.com/questions/61129773/how-do-the-store-buffer-and-line-fill-buffer-interact-with-each-other)) – Noah Jan 19 '21 at 07:41
1

@Noah: On an L1d store hit (i.e. already in M or E state), I'm pretty sure no LFB is involved at all. The store will just commit to L1d without checking previous store-buffer entries. IIRC, there's no evidence of the store buffer itself merging consecutive stores to the same line before they reach the end, only possible merging into LFBs waiting for an RFO. And BTW, that's Bee's answer; I only edited it. – Peter Cordes Jan 19 '21 at 16:21
1

@Noah: Cache hit stores can be committed 1/cycle, sufficient to keep up with store-data throughput for sustained / steady state. (Or in Ice Lake, 2/clock, except for 64-byte stores which can only commit 1/clock, so there might be something new there. But not just merging of stores within one line; I think that 2/clock commit is supposed to be for any two stores of 32B or less each). Committing a merged group could free multiple SB entries and make room for the front-end to issue more stores a few cycles sooner, maybe getting to independent work sooner. But that's probably minor vs. misses. – Peter Cordes Jan 19 '21 at 16:29
@PeterCordes regarding my earlier question about coalescing in the SB. Running some tests with 4x `ymm` stores in a small region that fits in L1 (1024 bytes). Dest is cache aligned. With ascending/contiguous access pattern `vmovdqu ymm, r; vmovdqu ymm, 32(r); vmovdqu ymm, 64(r); vmovdqu ymm, 96(r)` I see low store buffer stalls. If I swap the offset 32/0 or offsets 64/96 also low SB stalls. If I swap 32/96, order of magnitude spike in SB stalls. Makes me think even in L1 path there can be some coalescing with back to back stores to same line. Thoughts? (96,64,0,32 also low SB stalls) – Noah Jun 21 '21 at 19:40
in fact even if I repeatedly store to the exact same 128 byte region. With any pattern where the stores to AABB I see low SB stalls and any pattern ABAB has high SB stalls. – Noah Jun 21 '21 at 20:05
@Noah: Oh, interesting that you're seeing this for the L1 hit case. This is on Ice Lake, right? So 2/clock store execution, but apparently the advertized up-to-64B/clock commit to L1d (1x 64 or 2x32) is only when two 32-byte stores can merge and commit to one line. Probably the commit end of the store buffer looks at the first *2* entries to see if they're to the same line. That's very common for sequential traversal of an array so even that limited support is going to be useful in real life, without an extra write port in the cache. Intel's opt manual doesn't mention that caveat IIRC :/ – Peter Cordes Jun 21 '21 at 21:14
@PeterCordes further on SB coalescing... in a loop if you have `vmovdqa64 %zmm0, (%rdi), vmovdqa64 %ymm0, (%rdi)` the loop runs at 2c / iteration. Compared to `vmovdqa64 %ymm0, (%rdi), vmovdqa64 %ymm0, (%rdi)` or `vmovdqa64 %ymm0, (%rdi), vmovdqa64 %ymm0, 16(%rdi), vmovdqa64 %ymm0, 32(%rdi)` which run at 1c / iteration and 1.5c / iteration respectively. Note the two faster examples are both bound by store ports but the former is bound by something else. My intuition is points to some logic where if two stores a **partial** stores to same cache line they can combined at the point of writing... – Noah Oct 12 '21 at 21:58
Without some combining (in the second case) I can't explain why it is store port bound whereas the first case appears to be bound by something else. – Noah Oct 12 '21 at 21:59
I think that the coalescing must be happening in the writeback from SB -> L1, however, as with stores that can coalesce I still dont see store-forwarding across writes which i think we would expect if they were truly taking the same entry. – Noah Oct 12 '21 at 22:01
1

@Noah: IIRC, Intel's optimization manual says that (sustained) 2/clock store throughput is only possible on ICL with 32-byte or narrower stores. They *don't* say that it's only possible when the stores are to the same line, but it seems in practice that's the case, implying that the mechanism is merging, like probably the commit stage looking at the last 2 entries in the SB. – Peter Cordes Oct 12 '21 at 22:20
1

And yeah, I've been assuming that merging in Ice Lake (and earlier CPUs if done at all) is right at commit. Otherwise every SB entry would need logic to compare itself with adjacent entries, instead of just pick 2 and see if we can commit both. (Or I guess try to merge right at retire). But freeing up an SB entry earlier doesn't help (except for store-forwarding); you can't alloc it if you want a simple ring buffer for your SB to get memory ordering, not a power-intensive allocator like PRF and RS entries which also don't need to be ordered relative to each other. – Peter Cordes Oct 12 '21 at 22:23
1

@Noah: merging two 16-byte stores into one line but not contiguous with each other could even make store-forwarding worse, by effectively making a masked store to try to forward from. IDK how well forwarding would work if you did that with `vmovdqa64 %zmm0, (%rdi){%k1}`, but if it's not great, then at-commit merging is better for that, too. But really I think the key argument is that the simplest HW design would be to have the commit stage just pick the last 2 SB entries and, if neither is 64-byte, consider them for simultaneous commit. – Peter Cordes Oct 12 '21 at 22:28
1

@Noah: Note that this Ice Lake committing 2 SB entries per clock is different from using an LFB to merge cache-miss stores. That makes sense even if done 1/clock, just to free up SB entries when possible with a contiguous stream of stores. – Peter Cordes Oct 12 '21 at 22:29
@PeterCordes "Intel's optimization manual says that (sustained) 2/clock store throughput is only possible on ICL with 32-byte or narrower stores" Do you know where? Not seeing it. – Noah Oct 12 '21 at 23:02
@Noah: Table 2-4 on page 43, in Section 2.1.1.3 - Peak L1d bandwidth is *2×64B loads + **1x64B or 2x32B stores***, sustained = peak. That's *cache* bandwidth, not store-execution. It may not *explicitly* say anywhere (else) that 2/clock 64-byte stores can't be sustained (even without loads), but IIRC this is also a known fact from microbenchmarks like https://uops.info/ and Instlatx64. Hmm, strangely that Intel table lists L1d and 48kiB / 8-way. I was pretty sure they increased the size by bumping up the associativity to 12-way, not with other tricks to avoid aliasing problems. – Peter Cordes Oct 13 '21 at 03:08
@PeterCordes Thanks. But note the weird behavior here is that the with 64byte store and 32 byte store to the same line appear to be bottlenecked by something related to SB -> L1. On the other hand 3x 32byte stores to the same cache line are bottlenecked by store ports indicating there is some optimization happening in the SB. My intution given than overlapping writes can never be used for store-forwarding, the "SB coalescing" is related to merging the writeback to L1. This makes sense givens Travis' findings of LFB coalescing. Possible the SB has a shared mechanism to group writes to the same\ – Noah Oct 13 '21 at 18:41
cache line when writing out to the next step. Either LFB or L1. – Noah Oct 13 '21 at 18:42
1

@Noah: It's not strange if the commit-time coalesce hardware logic never tries to coalesce a 64B store with anything, just bailing out entirely if either of the two SB entries it looked at are 64B. Perhaps that would have required wider hardware that cost more power. It's an unusual case that won't come up in normal loops over arrays, so it makes sense to not bother to handle it. – Peter Cordes Oct 13 '21 at 20:38
@PeterCordes does loads coalesce in the Load Buffer the same way stores do? I.e AAABBBCCCA -> ABCA? – Noah Nov 28 '21 at 04:45
@Noah: Multiple cache-miss loads to the same line can each have their load buffer waiting for the same LFB, regardless of order. But sharing actual load buffers, no, I don't think that would make sense. Stuff inside the core has to happen when load data finally arrives, so the things associated with doing each individual load separate need to stay allocated. It wouldn't make sense of one load-buffer entry to be big enough to represent something like these 4 bytes go to phys reg 100, these to phys reg 101, etc. Probably dependent uops wait per load buffer. – Peter Cordes Nov 28 '21 at 04:51
@PeterCordes re: "Multiple cache-miss loads to the same line can each have their load buffer waiting for the same LFB," irrelivant of program order, or only in AAABBBAAA -> LFB for ABA? Can't see how AB and x86 guranteeing consistent load order can mesh. Or take the the case of loadA, loadA, storeA, loadA. All 3 As can coalesce? Woulnd't that violate local load/store ordering? – Noah Nov 28 '21 at 04:54
1

@Noah: Intel CPUs since P6 do *speculative* early loads. That's part of why `machine_clears.memory_order` is a thing. I think they check it by having the load buffer(?) / memory order buffer verify the cache line hasn't been invalidated since the actual load, by the time it's architecturally allowed to happen. – Peter Cordes Nov 28 '21 at 05:02

Why doesn't RFO after retirement break memory ordering?

1 Answers1

A simple mental model that's pretty close to reality

But is it that simple? No

Previous to this, we thought there was no store coalescing on Intel/AMD, but have long been puzzled by hints in Intel manuals about LFBs acting as WC buffers for stores to normal (strongly ordered) WB memory

Linked