2

We know that the dirty victim data is not immediately written back to RAM, it is stashed away in the store buffer and then written back to RAM later as time permits. Also, the store forwarding technique that if you do a subsequent LOAD to the same location on the same core before the value is flushed to the cache/memory, the value from the store buffer will be "forwarded" and you will get the value that was just stored. This can be done in parallel with the cache access, so it doesn’t slow things down.

My question is - With the help of the store buffer and store forwarding, the store misses don’t necessarily require the processor (correspond core) to stall. Therefore, store misses do not contribute to the total cache miss latency, right?

Thanks.

dalglish
  • 53
  • 5

1 Answers1

3

DRAM latency is really high, so it's easy for the store buffer to fill up and stall allocation of new store instructions into the back-end when a cache miss store stalls its progress. The ability of the store buffer to decouple / insulate execution from cache misses is limited by its finite size. It always helps some, though. You're right, stores are easier to hide cache-miss latency for.

Stalling and filling up the store buffer is more of a problem with a strongly ordered memory model like x86's TSO: stores can only commit from the store buffer into L1d cache in program order, so any cache-miss store blocks store-buffer progress until the RFO (Read For Ownership) completes. Initiating the RFO early (before the store reaches the commit end of the store buffer, e.g. upon retire) can hide some of this latency by getting the RFO in flight before the data needs to arrive.

Consecutive stores into the same cache line can be coalesced into a buffer that lets them all commit at once when the data arrives from RAM (or from another core which had ownership). There's some evidence that Intel CPUs actually do this, in the limited cases where that wouldn't violate the memory-ordering rules.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Might be worth adding that if there are loads after the store then if predicted to serialize (either by memory disambiguation unit or just 4k alignment) then can stall execution before SB fills. – Noah Jun 15 '21 at 14:15
  • A follow-up question, if a dirty block is evicted from the L1 cache, is it only stored directly to the main memory? Or is it written to the L2 cache firstly, and written back to the L3 when it is evicted from the L2? Are there any documents that show how the writeback policy works in multi-level cache in real hardware? Thank you very much. @Peter – dalglish Jul 13 '21 at 04:29
  • @dalglish: It's evicted outward to the next level of cache, not all the way to main mem. L2 and L3 are of course also write-back caches, otherwise performance would take a **huge** nosedive for some workloads when your array got just barely too big for L1d cache. You can look at memset / memcpy vs. size benchmarks to see the actual effect. – Peter Cordes Jul 13 '21 at 04:59
  • @PeterCordes Thanks. Do you have any documentation to suggest regarding the write-back strategy to be implemented in a real machine? – dalglish Jul 14 '21 at 05:41
  • @dalglish: You mean whether real CPUs try to clean some dirty lines when bus traffic is low, so future evictions can be cheaper? Not sure if Intel's (or AMD's) optimization manual mentions anything about that. If you mean replacement policy for outer caches, see https://blog.stuffedcow.net/2013/01/ivb-cache-replacement/ - IvB and later use adaptive replacement instead of just pseudo-LRU, to try to resist pollution from accesses with poor temporal locality. – Peter Cordes Jul 14 '21 at 05:44
  • @Peter Thanks. One more question, if we consider the accesses between L1d and L2, does the penalty for a load miss in the L1 data cache increase because the write buffer is writing to L2? I mean, if the write buffer is writing to L2, does the load access to L2 need to be stalled? I remember the bandwidth between L1d and L2 is only 64B. – dalglish Jul 14 '21 at 23:07
  • @dalglish: I don't know the details there. I'd guess there'd be some buffering so L1 can ask for a new line while the outgoing line is in flight. Perhaps just via the LFBs. So if there's no free LFB after starting the eviction, then the load miss might have to wait for that if pseudo-LRU picks a dirty line to evict in that set? Unless the same LFB can be used to both send a line to L2 *and* request one back, as part of the same transaction? I'm mostly guessing here about how it *might* work, but these are plausible given other things we know about Intel CPUs. – Peter Cordes Jul 14 '21 at 23:18
  • This would be pretty hard to microbenchmark because there are 12 LFBs in a Skylake CPU, so it would be hard to know you've created a situation with them all in use with dirty vs. clean evictions. – Peter Cordes Jul 14 '21 at 23:18