18

The Intel optimization manual talks about the number of store buffers that exist in many parts of the processor, but do not seem to talk about the size of the store buffers. Is this public information or is the size of a store buffer kept as a microarchitectural detail?

The processors I am looking into are primarily Broadwell and Skylake, but information about others would be nice as well.

Also, what do store buffers do, exactly?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Curious
  • 20,870
  • 8
  • 61
  • 146
  • 3
    @RobertHarvey Didn't quite understand why you put this question on hold. Is there something that was unclear? What in particular was too broad? – Curious Feb 25 '19 at 23:25
  • You mean other than the fact that you linked to a 700 page book with a dozen references to store buffers in it? – Robert Harvey Feb 25 '19 at 23:26
  • 2
    @RobertHarvey I mentioned that none of them talk about the size of store buffers. Which is what I want to ask about - is that public information? If not, what do people do when they want to estimate this? What do you recommend here? – Curious Feb 25 '19 at 23:27
  • To solve the problem with your question, reframe it so that it focuses on a specific, software-development problem you are having. Code examples are always helpful. Otherwise, a good answer would require several paragraphs, and perhaps the better part of a book chapter. See also https://stackoverflow.com/questions/11105827/what-is-a-store-buffer – Robert Harvey Feb 25 '19 at 23:29
  • 6
    @RobertHarvey I think OPs question is perfectly on topic. The only refinement he might need to make is nailing it down to a specific microarchitecture and a specific store buffer. Not all questions need to be about a software-development problem and I think you were too fast in closing this one. – fuz Feb 25 '19 at 23:30
  • 1
    See also [this](https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)), which says "Larger store buffer (56 entries, up from 42)." – Robert Harvey Feb 25 '19 at 23:31
  • @RobertHarvey - but what is the size of an entry? – Curious Feb 25 '19 at 23:32
  • And a bit farther down... "The store buffer has been increased by 42 entries from Broadwell to 56 for a total of 128 simultaneous memory operations in-flight or roughly 60% of all µOPs." – Robert Harvey Feb 25 '19 at 23:32
  • I think it's safe to say that each entry is somewhere between 64 bits and one complete processor instruction, give or take. – Robert Harvey Feb 25 '19 at 23:33
  • 4
    @RobertHarvey I don’t understand what got you to that conclusion. Can we reopen the question so people are incentivized to provide an answer with the relevant details? – Curious Feb 26 '19 at 00:10
  • @fuz What did you mean by a particular store buffer? My understanding is that each CPU has it's own store buffer and that's it? – Curious Feb 26 '19 at 00:29
  • 4
    This is not documented but a store buffer entry contains at least the store data, the store physical address, the store linear address, a store type field (basically the store opcode), a blocking code field, and other fields. For example, on a microarchitecture with AVX2 (but not AVX512) an entry size is at least 32 bytes (for data) + 39 bits (physical address) + 48 bits (linear address) + other smaller fields. We don't know exactly. – Hadi Brais Feb 26 '19 at 00:58
  • @HadiBrais So are you basically saying that this is an implementation detail within Intel processors and is not public information? – Curious Feb 26 '19 at 06:00
  • 1
    @Curious I am not super familiar with store buffers. There is no reason why a CPU should have only one of them and there might be multiple buffers for different tiers of storage. Mostly I wanted you to clarify what exactly you are referring to. – fuz Feb 26 '19 at 09:15
  • @fuz: See footnote 1 in my answer: write buffers elsewhere on chip wouldn't be called a "store buffer", I think that term is specific to buffering stores between execution and commit to L1d. Each (logical) core has 1 store buffer. I was thinking there was a question somewhere in here about what store buffers were, not just how big they were; maybe this question should be retitled to better fit the canonical answer I wrote about what store buffers are and what they do. – Peter Cordes Feb 26 '19 at 19:49

1 Answers1

28

Related: what is a store buffer? and a beginner-friendly (but detailed) intro to the concept of buffers in Can a speculatively executed CPU branch contain opcodes that access RAM? which I highly recommend reading for CPU-architecture background on why we need them and what they do (decouple execution from commit to L1d / cache misses, and allow speculative exec of stores without making speculation visible in coherent cache.)

Also How do the store buffer and Line Fill Buffer interact with each other? has a good description of the steps in executing a store instruction and how it eventually commits to L1d cache.


The store buffer as a whole is composed of multiple entries.

Each core has its own store buffer1 to decouple execution and retirement from commit into L1d cache. Even an in-order CPU benefits from a store buffer to avoid stalling on cache-miss stores, because unlike loads they just have to become visible eventually. (No practical CPUs use a sequential-consistency memory model, so at least StoreLoad reordering is allowed, even in x86 and SPARC-TSO).

For speculative / out-of-order CPUs, it also makes it possible roll back a store after detecting an exception or other mis-speculation in an older instruction, without speculative stores ever being globally visible. This is obviously essential for correctness! (You can't roll back other cores, so you can't let them see your store data until it's known to be non-speculative.)


When both logical cores are active (hyperthreading), Intel partitions the store buffer in two; each logical core gets half. Loads from one logical core only snoop its own half of the store buffer2. What will be used for data exchange between threads are executing on one Core with HT?

The store buffer commits data from retired store instructions into L1d as fast as it can, in program order (to respect x86's strongly-ordered memory model3). Requiring stores to commit as they retire would unnecessarily stall retirement for cache-miss stores. Retired stores still in the store buffer are definitely going to happen and can't be rolled back, so they can actually hurt interrupt latency. (Interrupts aren't technically required to be serializing, but any stores done by an IRQ handler can't become visible until after existing pending stores are drained. And iret is serializing, so even in the best case the store buffer drains before returning.)

It's a common(?) misconception that it has to be explicitly flushed for data to become visible to other threads. Memory barriers don't cause the store buffer to be flushed, full barriers make the current core wait until the store buffer drains itself, before allowing any later loads to happen (i.e. read L1d). Atomic RMW operations have to wait for the store buffer to drain before they can lock a cache line and do both their load and store to that line without allowing it to leave MESI Modified state, thus stopping any other agent in the system from observing it during the atomic operation.

To implement x86's strongly ordered memory model while still microarchitecturally allowing early / out-of-order loads (and later checking if the data is still valid when the load is architecturally allowed to happen), load buffer + store buffer entries collectively form the Memory Order Buffer (MOB). (If a cache line isn't still present when the load was allowed to happen, that's a memory-order mis-speculation.) This structure is presumably where mfence and locked instructions can put a barrier that blocks StoreLoad reordering without blocking out-of-order execution. (Although mfence on Skylake does block OoO exec of independent ALU instructions, as an implementation detail.)

movnt cache-bypassing stores (like movntps) also go through the store buffer, so they can be treated as speculative just like everything else in an OoO exec CPU. But they commit directly to an LFB (Line Fill Buffer), aka write-combining buffer, instead of to L1d cache.


Store instructions on Intel CPUs decode to store-address and store-data uops (micro-fused into one fused-domain uop). The store-address uop just writes the address (and probably the store width) into the store buffer, so later loads can set up store->load forwarding or detect that they don't overlap. The store-data uop writes the data.

Store-address and store-data can execute in either order, whichever is ready first: the allocate/rename stage that writes uops from the front-end into the ROB and RS in the back end also allocates a load or store buffer for load or store uops at issue time. Or stalls until one is available. Since allocation and commit happen in-order, that probably means older/younger is easy to keep track of because it can just be a circular buffer that doesn't have to worry about old long-lived entries still being in use after wrapping around. (Unless cache-bypassing / weakly-ordered NT stores can do that? They can commit to an LFB (Line Fill Buffer) out of order. Unlike normal stores, they commit directly to an LFB for transfer off-core, rather than to L1d.)


but what is the size of an entry?

Store buffer sizes are measured in entries, not bits.

Narrow stores don't "use less space" in the store buffer, they still use exactly 1 entry.

Skylake's store buffer has 56 entries (wikichip), up from 42 in Haswell/Broadwell, and 36 in SnB/IvB (David Kanter's HSW writeup on RealWorldTech has diagrams). You can find numbers for most earlier x86 uarches in Kanter's writeups on RWT, or Wikichip's diagrams, or various other sources.

SKL/BDW/HSW also have 72 load buffer entries, SnB/IvB have 64. This is the number of in-flight load instructions that either haven't executed or are waiting for data to arrive from outer caches.


The size in bits of each entry is an implementation detail that has zero impact on how you optimize software. Similarly, we don't know the size in bits of of a uop (in the front-end, in the ROB, in the RS), or TLB implementation details, or many other things, but we do know how many ROB and RS entries there are, and how many TLB entries of different types there are in various uarches.

Intel doesn't publish circuit diagrams for their CPU designs and (AFAIK) these sizes aren't generally known, so we can't even satisfy our curiosity about design details / tradeoffs.


Write coalescing in the store buffer:

Back-to-back narrow stores to the same cache line can (probably?) be combined aka coalesced in the store buffer before they commit, so it might only take one cycle on a write port of L1d cache to commit multiple stores.

We know for sure that some non-x86 CPUs do this, and we have some evidence / reason to suspect that Intel CPUs might do this. But if it happens, it's limited. @BeeOnRope and I currently think Intel CPUs probably don't do any significant merging. And if they do, the most plausible case is that entries at the end of the store buffer (ready to commit to L1d) that all go to the same cache line might merge into one buffer, optimizing commit if we're waiting for an RFO for that cache line. See discussion in comments on Are two store buffer entries needed for split line/page stores on recent Intel?. I proposed some possible experiments but haven't done them.

Earlier stuff about possible store-buffer merging:

See discussion starting with this comment: Are write-combining buffers used for normal writes to WB memory regions on Intel?

And also Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake may be relevant.

We know for sure that some weakly-ordered ISAs like Alpha 21264 did store coalescing in their store buffer, because the manual documents it, along with its limitations on what it can commit and/or read to/from L1d per cycle. Also PowerPC RS64-II and RS64-III, with less detail, in docs linked from a comment here: Are there any modern CPUs where a cached byte store is actually slower than a word store?

People have published papers on how to do (more aggressive?) store coalescing in TSO memory models (like x86), e.g. Non-Speculative Store Coalescing in Total Store Order

Coalescing could allow a store-buffer entry to be freed before its data commits to L1d (presumably only after retirement), if its data is copied to a store to the same line. This could only happen if no stores to other lines separate them, or else it would cause stores to commit (become globally visible) out of program order, violating the memory model. But we think this can happen for any 2 stores to the same line, even the first and last byte.

A problem with this idea is that SB entry allocation is probably a ring buffer, like the ROB. Releasing entries out of order would mean hardware would need to scan every entry to find a free one, and then if they're reallocated out of order then they're not in program order for later stores. That could make allocation and store-forwarding much harder so it's probably not plausible.

As discussed in Are two store buffer entries needed for split line/page stores on recent Intel?, it would make sense for an SB entry to hold all of one store even if it spans a cache-line boundary. Cache line boundaries become relevant when committing to L1d cache on leaving the SB. We know that store-forwarding can work for stores that split across a cache line. That seems unlikely if they were split into multiple SB entries in the store ports.


Terminology: I've been using "coalescing" to talk about merging in the store buffer, vs. "write combining" to talk about NT stores that combine in an LFB before (hopefully) doing a full-line write with no RFO. Or stores to WC memory regions which do the same thing.

This distinction / convention is just something I made up. According to discussion in comments, this might not be standard computer architecture terminology.

Intel's manuals (especially the optimization manual) are written over many years by different authors, and also aren't consistent in their terminology. Take most parts of the optimization manual with a grain of salt especially if it talks about Pentium4. The new sections about Sandybridge and Haswell are reliable, but older parts might have stale advice that's only / mostly relevant for P4 (e.g. inc vs. add 1), or the microarchitectural explanations for some optimization rules might be confusing / wrong. Especially section 3.6.10 Write Combining. The first bullet point about using LFBs to combine stores while waiting for lines to arrive for cache-miss stores to WB memory just doesn't seem plausible, because of memory-ordering rules. See discussion between me and BeeOnRope linked above, and in comments here.


Footnote 1:

A write-combining cache to buffer write-back (or write-through) from inner caches would have a different name. e.g. Bulldozer-family uses 16k write-through L1d caches, with a small 4k write-back buffer. (See Why do L1 and L2 Cache waste space saving the same data? for details and links to even more details. See Cache size estimation on your system? for a rewrite-an-array microbenchmark that slows down beyond 4k on a Bulldozer-family CPU.)

Footnote 2: Some POWER CPUs let other SMT threads snoop retired stores in the store buffer: this can cause different threads to disagree about the global order of stores from other threads. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?

Footnote 3: non-x86 CPUs with weak memory models can commit retired stores in any order, allowing more aggressive coalescing of multiple stores to the same line, and making a cache-miss store not stall commit of other stores.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/189096/discussion-on-answer-by-peter-cordes-size-of-store-buffers-on-intel-hardware-wh). –  Feb 26 '19 at 22:46
  • 1
    There is some good stuff in that comment thread, including some theorizing about why we get decreased performance when one loop has more than 4 output streams. (So loop fission is recommended there, according to the confusing and probably partly wrong section 3.6.10 Write Combining section in Intel's optimization manual.) – Peter Cordes Feb 26 '19 at 23:00
  • The site ensures us mods don't sleep :p –  Feb 26 '19 at 23:07
  • Question about "Atomic RMW operations have to wait for the store buffer to drain before they can lock a cache line and do both their load ....." This would imply there is a [StoreLoad] in front of the Atomic RMW operation? Is an Atomic RMW not just acquire load (+ locking of cache line) and a sequential consistent store (+ release of lock on cache line) – pveentjer May 28 '20 at 12:35
  • @pveentjer: No, just StoreStore and LoadStore, as always. The barrier is tied to the atomic RMW which is both a load *and* a store, and both are tied to each other in the global order. (That's what makes it an atomic RMW, not just separate load + store like you get without the `lock` prefix). If the store appeared earlier, it wouldn't be a proper release operation for earlier loads + stores. – Peter Cordes May 28 '20 at 19:29
  • I think I understand the problem. The issue is my lack of understanding of a barrier being `tied` to an instruction. Do you know where I can find some more information about this behavior? – pveentjer May 29 '20 at 01:30
  • @pveentjer: that's terminology I made up. It follows automatically from the fact that an atomic RMW is both a load and a store, i.e. happens at a single point in the global order. The store can't become visible before earlier stores, therefore the store buffer has to drain before the atomic RMW modifies a cache line. And the load can't become visible after later memory operations so they have to wait until the RMW commits. Letting anything happen *between* the load and store parts of an atomic RMW would make it non-atomic, by the very definition of atomicity = indivisible. – Peter Cordes May 29 '20 at 01:34
  • Now I'm back at my original confusion "The store can't become visible before earlier stores, therefore the store buffer has to drain `before` the atomic RMW modifies a cache line." This `before` doesn't compute. The atomic operation gets translated to a series of uops; a load, modify and then a store. And the load has acquire semantics (no waiting for SB to be drained is needed) and the store is a sc-store. So waiting for the SB to be drained happens on the store, not the load. – pveentjer May 29 '20 at 01:59
  • 2
    @pveentjer: right, but the core can't let any other loads or stores commit to this or any other cache line between that load or store, or else it wouldn't be an *atomic* RMW. `add [rdi], eax` isn't atomic, `lock add [rdi], eax` is. [Can num++ be atomic for 'int num'?](https://stackoverflow.com/q/39393850). From the POV of anything that could observe the difference, the load and store are part of the same single access to the cache line, the atomic RMW. Creating that illusion requires the microcode to use special uops that aren't part of normal memory-destination instructions. – Peter Cordes May 29 '20 at 02:02
  • I'll check the link out. I already had a problem with my approach because the acquire load provides everything but a [StoreLoad], so a store would be able to jump in front of the load. X=r1[LL][LS][SS]r2=Y, now the X=r1 could jump in front of the r2=y; so inside the atomic instruction. – pveentjer May 29 '20 at 02:06
  • I now agree that acquire-load is no good after reading Intel docs. Atomic instructions are not reordered with loads/stores, in case of an acquire load, the load could still be reordered with an earlier store. Full barrier is needed in front of load instead of acquire load. My issue is now what happens to the write. This must still be a SC write and this requires a 'second' StoreLoad; it will be very cheap since there is 1 store in SB and the cache line is already in exclusive state; but a second StoreLoad is needed. Am I missing something? – pveentjer Jun 01 '20 at 05:34
  • @pveentjer: The store side of an atomic RMW does directly access L1d cache, not just getting put in the store buffer. Whether that's by entering the store buffer and draining it again or some other mechanism doesn't really matter, but I'd guess that a load port (which does have access to L1d directly, unlike store-address / store-data ports) runs a uop that does the actual update and unlock of the cache line. This of course happens after the store buffer has already been drained; as you say that has to happen before the load side of an atomic RMW. – Peter Cordes Jun 01 '20 at 05:39
  • @pveentjer: I think BeeOnRope found some evidence that atomic RMWs on current microarchitectures actually speculatively load early, and have the data ready to swap + verify atomicity/ordering once the store buffer is actually drained. – Peter Cordes Jun 01 '20 at 05:40
  • I guess the store buffer only contains stores to memory; stores to a register use a different mechanism? – pveentjer Dec 03 '21 at 10:12
  • @pveentjer: Register writes aren't even called stores in computer architecture terminology. And no, of course the store buffer doesn't hold them; they only ever need to become visible to other parts of the CPU core, not in memory. There's no cache coherency to deal with, and no need to make sure they're non-speculative. – Peter Cordes Dec 03 '21 at 10:21
  • Thanks. I wanted to confirm that the store buffer only has stores to memory and not registers. – pveentjer Dec 03 '21 at 10:24