Are two store buffer entries needed for split line/page stores on recent Intel?

Question

It is generally understood that one store buffer entry is allocated per store, and this store buffer entry holds the store data and physical address¹.

In the case that a store crosses a 4096-byte page boundary, two different translations may be needed, one for each page, and hence two different physical addresses may need to be stored. Does this mean that page-crossing stores take 2 store buffer entries? If so, does it apply also to line-crossing stores?

¹ ... and perhaps some/all of the virtual address to help in store forwarding.

Does the SB hold physical addresses? I've always thought it held virtual addresses and that the physical address was needed only when checking for L1 hit/miss (in the latter case, it would be used to fetch data from the upper layers of the mem hierarchy). — Margaret Bloom, Apr 13 '20 at 10:31
@MargaretBloom - that is certainly possible! I don't have any specific proof, but my impression was that the store address uop does the address translation and writes the result into the store buffer, so the translation doesn't have to occur later when the commit gets to the L1. I think one of the spectre papers (about store forwarding) mentioned that the physical address is invovled and I've read it elsewhere too. I can think of various problems with having the lookup at L1 commit but nothing definitive. I've asked [another question](https://stackoverflow.com/q/61190976/149138) on it. — BeeOnRope, Apr 13 '20 at 15:16
Yes, that makes sense. Now that I remember, the store forwarding check is made of two checks: one fast check on VA and one slower check on PA, IIRC. It is possible/reasonable that a store crossing 4Ki is split into two SB entries. — Margaret Bloom, Apr 13 '20 at 16:14
@Margaret - yeah. About the split, the part that complicates the 2 entry theory is that SB entries are allocated at rename, long before you know anything about the address including if it splits. It seems hard to retroactively allocate another SB entry when you discover the split scenario. Page-split stores used to have terrible performance and maybe that's part of the reason why: e.g. it was a full pipeline stall. — BeeOnRope, Apr 13 '20 at 16:59
Henry Wong (the author of the stuffedcow blog), in his thesis, handle split page *loads* by executing them in two times (and using a single Load Queue entry) and using split access registers to combine the data. For *stores* it probably does the same (but no combining is required, of course): a single SB entry but executing the store in two times. I would not be surprised if this is what actual CPUs do. A patent shows a flow chart containing a step where both the VA and the PA are stored in the SB. But the VA is stored first (by the STD uop), infact ... — Margaret Bloom, Apr 13 '20 at 17:43
store forwarding (one of the sources for loads) is first done by checking only the low 12 bits (the reason behind 4K aliasing) and then confirmed when the TLB output the PA (which also confirms/refutes cache hits). So in the end I believe a SB entry contains both VA and PA but the latter is filled later (after the TLB lookup, the SB lookup for loads, the L1 lookup and the combining registers lookup, all done in parallel) and in case of split access, the uop is dispatched two times. A set of flags would track which halve has to be executed. — Margaret Bloom, Apr 13 '20 at 17:43
[Here's the picture from the patent](https://patentimages.storage.googleapis.com/74/6a/78/c296c09c49efec/US6378062-drawings-page-9.png) and [here](https://patents.google.com/patent/US20080082765A1/en?oq=US+2008%2f0082765+A1) a patent on store forwarding checks (slightly related to this question). — Margaret Bloom, Apr 13 '20 at 17:55
@MargaretBloom: Loads are fundamentally different from stores: hits need to read data from L1d cache directly, not just place entries in a buffer. For a store, accessing two separate lines in L1d only has to happen during commit, not execute. If you don't cross a 4k boundary, a cache-line boundary might not be significant for executing the store-address or store-data uops themselves. **Store-forwarding can work for cache-line split stores, even if the reload is also split across 2 lines.** (on Lynnfield and later) https://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/ — Peter Cordes, Apr 14 '20 at 04:51
@MargaretBloom: replay / dispatch I think only happens for uops *dependent* on a load with unexpectedly high latency (cache miss or split). Unfortunately Bee's and my earlier testing of split loads was using a chain of dependent load uops without any ALU uop in between, and we mis-attributed the cause of the replays to handling the split loads, not eager dispatch because of depending on an earlier split load. — Peter Cordes, Apr 14 '20 at 04:57
@PeterCordes I've used the wrong terminology. My point is that when the backend is sure a store is not speculative and ready to be drained from the SB, if it crosses a page boundary, it is "presented" to the memory subsystem twice, with adjusted addresses. — Margaret Bloom, Apr 14 '20 at 10:35
@MargaretBloom: Ok, then yes, your expectation matches mine. Commit from SB into L1d has to modify 2 different cache lines, so we expect *that* takes separate operations. But that doesn't tell us whether the execution unit can create the necessary SB entry with only one trip each through port 4 and port 2/3/7. — Peter Cordes, Apr 14 '20 at 16:03
If the SB does any merging of adjacent stores, that could sometimes reduce throughput requirements in the commit side vs. the exec side (not for purely split stores, but a mix of split and mergeable stores). Or if not, then at least the SB can fill more quickly and open space in the RS, giving a wider OoO exec window for bursty workloads. — Peter Cordes, Apr 14 '20 at 16:04
@PeterCordes - I am pretty sure by now that the SB doesn't merge adjacent stores. It's just the wrong place to do it, I think: if SB is allocated at rename and deallocated at (in-order) commit having some kind of merge in what is otherwise a FIFO queue seems tough. Also, robsize test 33 seems to show no merging: it writes to the same location every time, yet the store buffer size still appears exactly as documented, so there is no merging in this very simple case. — BeeOnRope, Apr 14 '20 at 19:07
If there is merging, I believe it happens at the head of the store buffer. E.g., the stores in the store buffer commit not directly into L1, but into one or more cache-line sized staging buffers, which are committed at some point to L1. This means that stores to the same line get absorbed by this buffer and reduce the L1 traffic. I am not sure even this exists, and whether they are shared with LFB, etc. — BeeOnRope, Apr 14 '20 at 19:09
That would make sense, although it wouldn't increase store commit throughput except maybe if this can happen while waiting for exclusive ownership of a line. Devil's advocate: I was hoping that maybe adjacent entries could combine into one, and leave one of them empty. FIFO allocation means it wouldn't get reused until the merged stores commit, so no gain in effective SB size, but if L1d write ports were the limit it could help with that. e.g. if one split store uop can create an SB entry that takes 2 cycles to commit, merging could balance that out. — Peter Cordes, Apr 14 '20 at 19:15
But merging is only easy if SB entries hold data for exactly one cache line, *not* a split. Since we think that SB entries are 32 or 64B wide unaligned, merging in the SB itself is not actually simple regardless of allocation! So yes, commit into some kind of merge buffer for a single cache line is plausible at the end of the SB, but not so much before that. — Peter Cordes, Apr 14 '20 at 19:17
@PeterCordes - well it can increase the throughput if more than 1 consecutive SB entry can be merged into the buffer in once cycle. It also reduces pressure on the write port. For example, maybe the write port cannot be used every cycle depending on what the read ports are doing, or if a line arrives from L2 and this could prevent conflicts in those cases. I don't know if any of this exists through, but the zombie load paper (IIRC) claimed something like this (since they could read stored data from what they said was the LFB). — BeeOnRope, Apr 15 '20 at 00:03
I asked a [related question here](https://stackoverflow.com/q/53435632/149138) but I wasn't clear if I was talking about the cache miss or cache hit case. Hadi answered it for the miss case, but I'm not actually sure if that is even meaningful: what is the difference between an LFB and WC buffer, after all? I guess what it shows is that the LFB/WC buffer does absorb stores in the miss case, without stalling the SB. I suppose a test for the hit case is needed now... — BeeOnRope, Apr 15 '20 at 00:23
BTW, there was a question recently about performance dropping with > 4 store streams. Do you know how I can find it? — BeeOnRope, Apr 15 '20 at 00:24
Agreed, that all makes sense. If multiple SB entries could be collected per cycle into a merge buffer, that would be measurable with split stores mixed with aligned dword stores like I suggested earlier, if the store execution units can create 1 SB entry per clock even for split stores. If not, then yes good point about hiding cycles when the write port is used for an incoming line from another cache. Harder to test but still something the designers might have considered worth doing. — Peter Cordes, Apr 15 '20 at 00:30
I think I found it: google for `site:stackoverflow.com store 4 streams performance` => https://stackoverflow.com/questions/linked/47851120?sort=newest linked from your old Bi-modal question => [For-loop efficiency: merging loops](https://stackoverflow.com/q/51021262) — Peter Cordes, Apr 15 '20 at 00:31

Are two store buffer entries needed for split line/page stores on recent Intel?

0 Answers0

Linked