5

Let's say we have this pseudo code, where ptr is not in any CPU cache:

prefetch_to_L1 ptr
/* 20 cycles */
load ptr

Since ptr is in main memory, the latency of the prefetch operation (from prefetch instruction decoding to ptr being available in L1 cache) is much greater than 20 cycles. Will the latency of the load be reduced at all by the in-progress prefetch? Or is the prefetch useless unless it completes before the load?

Naively (without much understanding of how the memory system works) I could see it working two ways:

  • When the CPU executes the load, it somehow identifies that a prefetch is in progress for the same address, and waits for the prefetch to complete before loading from L1.
  • The CPU sees that the address is not currently in cache and goes to main memory, ignoring the prefetch operation executing in parallel.

Is one of these correct? Is there some third option I haven't thought of? I'm interested in Skylake in particular, but also just trying to build some general intuition.

Elliot Gorokhovsky
  • 3,610
  • 2
  • 31
  • 56
  • 1
    You'd certainly think there would be some mechanism to identify a fetch already in progress, since it would also arise in out-of-order execution. Imagine `load ptr; do other stuff; load ptr again;`. While waiting for `load ptr` to complete, we might speculatively execute all the `other stuff` out of order, and then get to `load ptr again`. It would be silly to issue a second request for the same cache line, and especially silly to wait for it to be loaded *again* from L2 after the previous request gets it to L1. I have to believe the designers thought of that. – Nate Eldredge Feb 19 '22 at 17:02
  • It should be pretty easy to test if you wanted. – Nate Eldredge Feb 19 '22 at 17:03
  • @NateEldredge: Yeah, pretty sure a demand load for data that was already requested by SW or HW prefetch will identify the existing LFB or superqueue entry waiting for that data, and attach itself as another thing to be notified when it eventually arrives. A mechanism like that is needed to avoid duplicate requests for the same cache line when you get demand loads for `a[0]`, `a[1]`, etc. (IDK if there's any additional complication for promoting prefetches to demand-loads, like making sure outer levels didn't discard the request if their buffers were full.) – Peter Cordes Feb 19 '22 at 22:34
  • 1
    @NateEldredge: Skylake has a perf event for `load_hit_pre.sw_pf` - *Demand load dispatches that hit L1D fill buffer (FB) allocated for software prefetch*. – Peter Cordes Feb 19 '22 at 23:03
  • 2
    Related: [Does software prefetching allocate a Line Fill Buffer (LFB)?](https://stackoverflow.com/q/19472036) / [How to measure late prefetches and killed prefetches on Haswell micro-architecture?](https://stackoverflow.com/q/48008741). See also https://www.realworldtech.com/haswell-cpu/5/ re: line-fill buffers. Also [What Every Programmer Should Know About Memory?](https://stackoverflow.com/a/8126441) - SW prefetch is often not worth it for sequential access. – Peter Cordes Feb 19 '22 at 23:12

0 Answers0