Let's say we have this pseudo code, where ptr
is not in any CPU cache:
prefetch_to_L1 ptr
/* 20 cycles */
load ptr
Since ptr
is in main memory, the latency of the prefetch operation (from prefetch instruction decoding to ptr
being available in L1 cache) is much greater than 20 cycles. Will the latency of the load be reduced at all by the in-progress prefetch? Or is the prefetch useless unless it completes before the load?
Naively (without much understanding of how the memory system works) I could see it working two ways:
- When the CPU executes the load, it somehow identifies that a prefetch is in progress for the same address, and waits for the prefetch to complete before loading from L1.
- The CPU sees that the address is not currently in cache and goes to main memory, ignoring the prefetch operation executing in parallel.
Is one of these correct? Is there some third option I haven't thought of? I'm interested in Skylake in particular, but also just trying to build some general intuition.