what is a store buffer?

Question

can anyone explain what is load buffer and how it's different from invalidation queues. and also difference between store buffers and write combining buffers? The paper by Paul E Mckenny http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.07.23a.pdf explains very nicely about the store buffers and invalidation queues but unfortunately doesn't talk about write combining buffers

See also [Size of store buffers on Intel hardware? What exactly is a store buffer?](//stackoverflow.com/q/54876208) — Peter Cordes, May 10 '19 at 03:48
Also [this more successful (IMO) attempt](https://stackoverflow.com/questions/64141366/can-a-speculatively-executed-cpu-branch-contain-opcodes-that-access-ram) at describing what a store buffer is and what it does, I think more beginner-friendly. — Peter Cordes, Oct 12 '20 at 11:45

score 51 · Answer 1 · edited Jun 22 '12 at 10:26

51

An invalidate queue is more like a store buffer, but it's part of the memory system, not the CPU. Basically it is a queue that keeps track of invalidations and ensures that they complete properly so that a cache can take ownership of a cache line so it can then write that line. A load queue is a speculative structure that keeps track of in-flight loads in the out of order processor. For example, the following can occur

CPU speculatively issue a load from X
That load was in program order after a store to Y, but the address of Y is not resolved yet, so the store does not proceed.
Y is resolved and it turns out to be equal to X. At the time that the store to Y is resolved, that store searches the load queue for speculative loads that have issued, but are present after the store to Y in program order. It will notice the load to X (which is equal to Y) and have to squash those instructions starting with load X and following.

A store buffer is a speculative structure that exists in the CPU, just like the load queue and is for allowing the CPU to speculate on stores. A write combining buffer is part of the memory system and essentially takes a bunch of small writes (think 8 byte writes) and packs them into a single larger transaction (a 64-byte cache line) before sending them to the memory system. These writes are not speculative and are part of the coherence protocol. The goal is to save bus bandwidth. Typically, a write combining buffer is used for uncached writes to I/O devices (often for graphics cards). It's typical in I/O devices to do a bunch of programming of device registers by doing 8 byte writes and the write combining buffer allows those writes to be combined into larger transactions when shipping them out past the cache.

edited Jun 22 '12 at 10:26

Martin Thompson

16,395
1
38
56

answered Jun 21 '12 at 00:57

Nathan Binkert

8,744
1
29
37

35

I just noticed the question, and was going to answer it - heck, I invented Intel's write combining and load buffers, or at least my name is on many of the patents - but the answer above is perfectly fine. – Krazy Glew Jun 22 '12 at 18:41
10

Store buffers = not always speculative, not always inside CPU. There may be store buffers outside the CPU, e.g. between write through L1 and L2. – Krazy Glew Jun 22 '12 at 19:03
13

Load buffers (1) hold loads after load-addresses have been calculated, but until the load is really ready to execute; or, after you tried to execute a load, but determined that there was a problem, like a cache miss or an earlier store to the same address that is not yet data ready. (2) can be used to verify that the out-of-order loads are correctly speculated, as Martin describes. – Krazy Glew Jun 22 '12 at 19:06
7

There are other data structures - called full buffers at Intel, MSHRs (Miss Status Handling Registers) elsewhere - that track cache misses at line granularity. When a cache miss is completed, typically one load is directly woken up, and the others are woken up out of the load buffer – Krazy Glew Jun 22 '12 at 19:09
7

Many chips (although not Intel's flagship processors) have combined LSQ (load store queues) – Krazy Glew Jun 22 '12 at 19:09
9

I prefer the term "snoop queue" or "probe queue" to McKinney's "invalidation queue", because the last refers to only one particular type, albeit the most common. E.g. it doesn't apply to update protocols. Anyway, Nathan is right, in that invalidations or snoops or probes most importantly reflect stores done by other processors in the system, that your processor needs to see. – Krazy Glew Jun 22 '12 at 19:11
10

Write combining buifferrs are used to combine multiple small writes into bigger writes. Intel's WC buffers are very similar to fill buffers, are cache line sized, etc. They are buffers because they may be filled out of order, at least for the WC memory type. (They may be filled strictly in order for other memory types.) – Krazy Glew Jun 22 '12 at 19:13
@KrazyGlew Does CPU really has a component called Invalidate Queue？Why can't I find its description in any processor textbook? – haolee Apr 07 '22 at 11:09
@haolee: yes, many CPUs have had something called a “invalidation Q“ or “invalidate Q“ or “probe Q“ or something similar, to buffer snoop/invalidate/probes request from bus before they get sent to the cache tags to perform their action. IIRC on at least one system was subsumed an L2$ super Q, that performed this function as well as other related functions such as scheduling writebacks, etc. Why not in textbooks? Typically because textbooks were written 10 or 20 years after CPUs like this were designed, usually by academics with little experience who reinvent and rename things – Krazy Glew Apr 07 '22 at 14:57
@KrazyGlew Really helpful. I have another question about invalidate queue. If a core load a value before completing its invalidate requests, it will load a stale value. Is this problem called "LoadLoad" reordering? If so, why it's called "reordering" given that it just loads stale value and is not related to the instruction execution order. – haolee Apr 07 '22 at 17:54
1

@haolee: things like LoadLoad or RAR re-ordering are used even for in order architectures because they refer to the reordering of requests in a hypothetical global order for interleaving of request from different processors. (Or even the same processor.) Essentially, the moment you have buffering or queuing in a system things get re-ordered because of different delays and interlocks or lack thereof. IIRC the first good paper I saw this was by Christoph Scheurich and was titled something about buffering. Long before out of order CPU architectures were a common thing. – Krazy Glew Apr 07 '22 at 18:39
@KrazyGlew I'm also curious about the relationship between memory ordering and Out-of-Order execution in relaxed consistency model architectures (DEC Alpha/RISC-V/ARM). Let's assume the program order is "load a; load b". If the invalidation queue doesn't exist, core0 executes instructions out of order (i.e. load b; load a), would this instruction misordering be visible to core1? In other words, is memory reordering caused by both Out-of-Order execution and store buffer and invalidation queue? – haolee Apr 08 '22 at 02:49
@KrazyGlew Does memory barrier has two functions, one is forbidding the Out-of-Order execution, the other is flushing invalidation queue and draining store buffer? – haolee Apr 08 '22 at 02:50
@haolee: This topic is too large to fit in the margins of these comments on this answer in Stack Exchange – Krazy Glew Apr 08 '22 at 02:51
@KrazyGlew I agree. It's a pity. I have read many SO questions and answers but can't find an accurate explanation from a hardware perspective. I also have a dedicated SO question. If you have time, this would be a better place to clarify all these things. – haolee Apr 08 '22 at 03:24
@KrazyGlew sorry, forgot to paste the url in last comment. https://stackoverflow.com/q/71768672/4112667 – haolee Apr 08 '22 at 05:23

what is a store buffer?

1 Answers1

Linked