Is prefetching triggered by the stream of exact addresses or by the stream of cache lines?

Question

On modern x86 CPUs, hardware prefetching is an important technique to bring cache lines into various levels of the cache hierarchy before they are explicitly requested by the user code.

The basic idea is that when the processor detects a series of accesses to sequential or strided-sequential¹ locations, it will go ahead and fetch further memory locations in the sequence, even before executing the instructions that (may) actually access those locations.

My question is if the detection of a prefetch sequence is based on the full addresses (the actual addresses requested by user code) or the cache line addresses which is pretty much the address excluding the bottom 6 bits² stripped off.

For example, on a system with a 64-bit cache line, accesses to full addresses 1, 2, 3, 65, 150 would access cache lines 0, 0, 0, 1, 2.

The difference could be relevant when a series of accesses is more regular in the cache line addressing than the full addressing. For example, a series of full addresses like:

32, 24, 8, 0, 64 + 32, 64 + 24, 64 + 8, 64 + 0, ..., N*64 + 32, N*64 + 24, N*64 + 8, N*64 + 0

might not look like a strided sequence at the full address level (indeed it might incorrectly trigger the backwards prefetcher since each subsequence of 4 accesses looks like an 8-byte strided reverse sequence), but at the cache line level it looks like its going forwards a cache line a time (just like the simple sequence 0, 8, 16, 24, ...).

Which system, if either, is in place on modern hardware?

Note: One could imagine also that the answer wouldn't be based on every access, but only accesses which miss in the some level of the cache that the prefetcher is observing, but then the same question still applies to the filtered stream of "miss accesses".

¹Strided-sequential just means that accesses that have the same stride (delta) between them, even if that delta isn't 1. For example, a series of accesses to locations 100, 200, 300, ... could be detected as strided access with a stride of 100, and in principle the CPU will fetch based on this pattern (which would mean that some cache lines might be "skipped" in the prefetch pattern).

² Here assuming a 64-bit cache line.

I'm not sure but based on the graph in the Intel Optimization Manual, section 7.5.3, the HW prefetcher ability to hide a cache-miss latency depends on the stride in bytes (i.e. addresses). If it used cache addresses, I guess we would see flat lines within segments of 64 bytes. Not sure, though. — Margaret Bloom, Dec 09 '17 at 21:55
According to Intel's optimization manual (section 2.3.5.4 about SnB), the streamer (in L2) only looks at patterns of lines requested by L1D / L1I. But it's not clear what the wording means for the L1D prefetcher. I *think* I remember reading that a sequence of loads within one cache line can trigger prefetch of the next, which is one of the possible interpretations of the description of the DCU streaming prefetcher as *"is triggered by an ascending access to very recently loaded data"*. But the IP-based prefetcher can still detect 3 steps forward / 2 steps back on a per-insn basis. — Peter Cordes, Dec 10 '17 at 01:31
Tangentially related: [the L2 stream prefetcher seem to be triggered by access, not by misses](https://groups.google.com/d/msg/comp.arch/71wnqr_F9sw/bIgAVl04BgAJ) which is also a result I've seen lately in my testing. — BeeOnRope, Dec 21 '17 at 20:45
@PeterCordes re" But the IP-based prefetcher can still detect 3 steps forward / 2 steps back on a per-insn basis" what do you mean by that? Do you mean its like the branch predictor in that it has a history? Or that it will detect order with sub-cacheline precision? It would make sense for the IP prefetcher to be the only one to be affected as based on explination [here](https://stackoverflow.com/questions/20544917/prefetching-data-at-l1-and-l2) its only can that can detect strides. Also I think that the prefetchers (or some of them at least) don't take the full address but only page offset. — Noah, Apr 16 '21 at 18:58

score 2 · Answer 1 · answered Dec 14 '18 at 11:52

The cache line offsets can be useful but they also can be misleading as your example shows. I will discuss the how line offsets impact the data prefetchers on modern Intel processors based on my experiments on Haswell.

The method I followed is simple. First, I disable all the data prefetchers except the one I want to test. Second, I design a sequence of accesses that exhibit a particular pattern of interest. The target prefetcher will see this sequence and learn from it. Then I follow that by an access to a particular line to determine whether the prefetcher has prefetched that line or not by accurately measuring the latency. The loop doesn't contain any other loads. It contains though one store used to store the latency measurement in some buffer.

There are 4 hardware data prefetchers. The behaviors of the DCU prefetcher and the L2 adjacent line prefetcher are not affected by the pattern of the line offsets, but only by the pattern of 64-byte aligned addresses.

My experiments don't show any evidence that the L2 streaming prefetcher even receives the cache line offset. It seems that it only gets the line-aligned address. For example, by accessing the same line multiple times, the offset pattern by itself does not seem to have an impact on the behavior of the prefetcher.

The DCU IP prefetcher shows interesting behavior. I've tested two cases:

If a load has decreasing offsets, the prefetcher will prefetch one or more lines both in the forward and backward direction.
If a load has increasing offsets, the prefetcher will prefetch one or more lines but only in the forward direction.

did you notice any difference for store prefetching? – Noah Sep 07 '21 at 19:15 — Noah, Sep 07 '21 at 19:15

Is prefetching triggered by the stream of exact addresses or by the stream of cache lines?

1 Answers1

Linked