Access behavior of MM_PREFETCH intrinsic - Intel

Question

The documentation available here mentions that data available at the specified address is brought from memory to the cache line (the cache level provided as a hint).

However, I am confused whether or not the LLC is also accessed (assuming the hint specifies L1D or L2) or is it the case that the memory is always accessed - irrespective of whether or not the data may be available in LLC.

The reason I'm asking this is that, in certain experiments of mine, I've found that using _mm_prefetch intrinsic has increased my LLC-loads count (perf event), even though I'm getting an overall performance benefit.

Any question related to the behavior of a processor implementation needs to include a clear identification of *which* processor is being discussed. (Even seemingly simple questions correspond to a large number of special cases on any particular implementation -- multiplying that complexity by an unspecified implementation makes it extremely hard to give a clear response.) — John D McCalpin, Nov 26 '21 at 22:30

score 3 · Accepted Answer · answered Nov 22 '21 at 20:54

3

Prefetches can hit in LLC; it would be a pretty poor design if they cost extra DRAM traffic to get data into L1d if it was already hot in L2 or L3.

Also, the copy in L3 might be dirty so it definitely has to check L3 anyway for correctness.

The only real design choice is whether data is added to L3 if it wasn't already present. On Intel CPUs since Nehalem, before Skylake-X, L3 is an Inclusive cache, so there's no choice. (Difference between PREFETCH and PREFETCHNTA instructions)

On SKX and later, with the mesh interconnect between cores and smaller non-inclusive L3, prefetchnta can avoid replacing a line in L3 if it wasn't already hot, but other prefetches will still choose to populate data in outer levels of cache like a demand-load. (Except stopping at whatever level of cache is specified in the prefetch hint).

answered Nov 22 '21 at 20:54

Peter Cordes

328,167
45
605
847

One more doubt.. I was not able to find any references on what happens when the prefetch address is across the page boundary. Any ideas regarding this? – Harsh Kumar Nov 25 '21 at 16:19
@HarshKumar: That's impossible by design: prefetch takes an `m8` memory operand (https://www.felixcloutier.com/x86/prefetchh), and a single byte can't span a page boundary. – Peter Cordes Nov 25 '21 at 17:44
Corder No, I was asking if the prefetch address corresponds to another page - other than the one in which the currently processed data is available. So, if my computation is accessing A[15], but I'm prefetching A[1500], then A[1500] will be in another page.. – Harsh Kumar Nov 25 '21 at 17:56
@HarshKumar: old CPUs used to ignore software prefetch hints on TLB miss, but that hasn't been the case for years. SW prefetch distance is a tricky thing to tune (too far and cache pollution may evict the data again before you get to it, especially for PREFETCHNTA, and bandwidth depends on the whole system not just the current process.) But SW prefetching, if it's useful at all for a sequential access pattern, should probably be something like 1 to 4kiB ahead of where you're reading / writing. – Peter Cordes Nov 25 '21 at 21:50

Access behavior of MM_PREFETCH intrinsic - Intel

1 Answers1