5

Let's say we have a processor with two cores (C0 and C1) and a cache line starting at address k that is owned by C0 initially. If C1 issues a store instruction on a 8-byte slot at line k, will that affect the throughput of the following instructions that are being executed on C1?

The intel optimziation manual has the following paragraph

When an instruction writes data to a memory location [...], the processor ensures that it has the line containing this memory location is in its L1d cache [...]. If the cache line is not there, it fetches from the next levels using a RFO request [...] RFO and storing the data happens after instruction retirement. Therefore, the store latency usually does not affect the store instruction itself

With reference to the following code,

// core c0
foo();
line(k)->at(i)->store(kConstant, std::memory_order_release);
bar();
baz();

The quote from the intel manual makes me assume that in the code above, the execution of the code will look as if the store was essentially a no-op, and would not impact the latency between the end of foo() and the start of bar(). In contrast, for the following code,

// core c0
foo();
bar(line(k)->at(i)->load(std::memory_order_acquire));
baz();

The latency between the end of foo() and the start of bar() would be impacted by the load, as the following code has the result of the load as a dependency.


This question is mostly concerned with how intel processors (in the Broadwell family or newer) work for the case above. Also, in particular, for how C++ code that looks like the above gets compiled down to assembly for those processors.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Curious
  • 20,870
  • 8
  • 61
  • 146
  • 1
    You can use https://godbolt.org/ to see compiler-generated asm easily; see [How to remove "noise" from GCC/clang assembly output?](https://stackoverflow.com/q/38552116) for tips on writing C examples that compile to interesting asm. – Peter Cordes Jun 17 '20 at 00:08
  • 2
    *The latency between the end of foo() and the start of bar() would be impacted by the load*. There's no data dependency between `foo()` and `k.load()`, so latency doesn't apply. Out-of-order exec can potentially get started on that load while `foo()` is still executing. But yes the load itself will be high latency from execution to result arriving, so ideally it can execute and start that process as early as possible. – Peter Cordes Jun 17 '20 at 00:17
  • 1
    But yes, for your first example, the store buffer decouples store misses from execution. This is one of the major reasons for having a store buffer, the other being to keep speculative execution of stores private to this core. See also [Size of store buffers on Intel hardware? What exactly is a store buffer?](https://stackoverflow.com/q/54876208). – Peter Cordes Jun 17 '20 at 00:18
  • @PeterCordes :) I asked that question as well! Looks like I might have not fully understand the concept... – Curious Jun 17 '20 at 01:38
  • 1
    My answer there got kind of bogged down in some technical details and isn't the best summary of the high-level key points of what a store buffer is. That's why I later added some other links at the top. Ideally I'd rewrite parts of it but I tend to get bored part way through large edits and never finish. :/ – Peter Cordes Jun 17 '20 at 02:21
  • Perhaps also worth mentioning that a load on the same core can store-forward a store from the store buffer before it commits, even while waiting for a store miss. (Unless you use a `seq_cst` store; that forces draining the store buffer before later loads, exactly to prevent this.) – Peter Cordes Jun 17 '20 at 04:03
  • @PeterCordes I didn't follow that last scenario. Could you give an example to illustrate maybe? – Curious Jun 17 '20 at 04:41
  • `push 1` / `pop rax` can execute the load and let later instructions read RAX, even if an earlier cache-miss store is still waiting for an RFO. Also https://preshing.com/20120515/memory-reordering-caught-in-the-act/ (adding mfence makes it a seq_cst store). Also [Can modern x86 implementations store-forward from more than one prior store?](https://stackoverflow.com/q/46135766) re: store-forwarding in general. [Can x86 reorder a narrow store with a wider load that fully contains it?](https://stackoverflow.com/q/35830641) is another example of code broken by store-forwarding. – Peter Cordes Jun 17 '20 at 04:59

1 Answers1

6

Generally speaking, for a store that is not soon read by subsequent code, the store doesn't directly delay that subsequent code on any modern out-of-order processor, including Intel.

For example:

foo()
*x = y;
bar()

If foo() doesn't modify x or y, and bar doesn't load from *x, the store is independent and may start executing even before foo() is complete (or even before it starts), and bar() may execute before the store commits to the cache, and bar() may even execute while foo() is running, etc.

While there is little direct impact, it doesn't meant there aren't indirect impacts and indeed the store may dominate the execution time.

If the store misses in cache, it may tie up off-core resources while the cache miss is satisfied. It also usually prevent subsequent stores from draining, which may be a bottleneck: if the store buffer fills up, the front-end blocks entirely and new instructions no longer enter the scheduler.

Finally, everything depends on the details of the surrounding code, as usual. If that sequence is run repeatedly, and foo() and bar() are short, the misses related to the store may dominate the runtime. After all, buffering can't hide the cost of an unlimited number of stores. At some point you'll be bound by the intrinsic throughput of the stores.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • I see. If we are guaranteed that the memory address being stored on is not cached in the core that is executing the store instruction, then is the store buffer the only resource that limits the throughput of the following code? – Curious Jun 17 '20 at 01:00
  • And related - what typically is the size of such a store buffer on intel hardware? (a rough range is good!) – Curious Jun 17 '20 at 01:00
  • 1
    @Curious I curate a table of this and other resource sizes in [this blog post](https://travisdowns.github.io/blog/2019/06/11/speed-limits.html#ooo-table). Store buffer sizes have varied from 36 on Sandy Bridge, to 72 on Ice Lake. – BeeOnRope Jun 17 '20 at 01:02
  • @Curious - about your other question, I wouldn't really say that. It is "complicated". For example, a store miss uses up a fill buffer that is not available to subsequent code. The store itself could cause page walks if the address misses in the TLB, and so not retire until a while later, which could block subsequent code. Similarly if the data for the store is not available. – BeeOnRope Jun 17 '20 at 01:06
  • 2
    @Curious - the units for store buffer is "entries" aka individual stores. E.g., if a store buffer has 36 entries, it can hold 36 stores, regardless if they are to the same cache line or not. Every store uop requires one store buffer entry. – BeeOnRope Jun 17 '20 at 01:07
  • You might want to expand on the point that the store buffer not draining can lead to it filling up and blocking *allocation* of later store instructions, i.e. stalling the front-end. – Peter Cordes Jun 17 '20 at 01:18
  • @PeterCordes added a bit – BeeOnRope Jun 17 '20 at 01:26
  • Also - does the answer apply if the core owning the cache line is another NUMA node entirely? (I am guessing it does, but want to clarify in case there are some subtleties here) – Curious Jun 17 '20 at 01:33
  • 2
    @Curious - yes, it doesn't matter. The core can't know that anyway, so the store proceeds in the same way regardless of where the line is. When it gets to the head of the store buffer, "miss processing" will start (more or less) and if it's in another NUMA node that might just take longer than usual, but there are no fundamental differences at the core level. – BeeOnRope Jun 17 '20 at 01:34
  • 2
    Note that the fact that the store miss blocks subsequent stores from committing, means that it is highly likely it will be a problem for a long miss. If a miss takes 100 ns, that's 400 cycles on a 4 GHz cpu, which is 800 instructions with an IPC of 2. If those 800 instructions have more than "store buffer size" stores, you'll stall. It would be not be uncommon for 800 instructions to have 50-100 stores or more. @Curious – BeeOnRope Jun 17 '20 at 01:36
  • 1
    That's also considering the best case where the store buffer was initially empty so that the store in question gets to start processing right away. – BeeOnRope Jun 17 '20 at 01:37