How _mm_prefetch works?

Question

The _mm_prefetch call as stated here prefetches the contents from a given memory location in RAM to cache line. But the cache is completely under the hardware control right? Based on which memory (based on spatial/temporal locations) is accessed a lot, the hardware prefetched the contents from memory to cache. I thought that programmers have no control over cache and it is completely a hardware mechanism.

So my understanding is wrong and cache can actually be controlled by us, right?

If _mm_prefetch can control what can be put inside cache,

does that mean it will never be removed from cache when?
what is the equivalent assembly level instruction which works on cache mechanisms?

See also [What Every Programmer Should Know About Memory?](https://stackoverflow.com/q/8126311) for some comments on software prefetching being rarely worth it in modern CPUs, vs. when Ulrich Drepper originally wrote that article. — Peter Cordes, Jun 28 '21 at 20:31

score 12 · Accepted Answer · edited Jun 28 '21 at 20:13

We can always move data into the cache, if active, by simply performing a memory access.
We can prefetch a var by simply "touching" it ahead of time, we don't need a special instruction for that.

It's unclear what you mean by "control over the cache" as we can enable/disable it, set its mode, its fill/spill policy and sharing mode with other HW threads.
We can also fill the cache with data and by clever use of arithmetic force the eviction of a line.

Your assumption that programmers have to control whatsoever over the cache is then not entirely valid, though not incorrect: the CPU is free to implement any cache policy it wants as long as it respects the documented specification (including having no cache at all or spilling the cache every X clock ticks).
One thing we cannot do, yet, is to pin lines in the cache, we cannot tell the CPU to never evict a specific line.

EDIT As @Mysticial pointed out in the comments, it is possible to pin data into the L3 cache in newer Intel CPUs.

The PREFETCHT0, PREFETCHT1, PREFETCHT2, PREFETCHTNTA and PREFETCHWT1 instructions to which _mm_prefetch is compiled to are just a hint for the hardware prefetchers if present, active, and willing to respect the hint¹.

Their limited use cases³ come more from the finer control over the cache hierarchy level the data will stop in and the reduced use of the core resources² rather than as way to move the data into the cache.

Once a line has been prefetched it is removed from the cache as any other line would.

¹ These hardware prefetchers are usually triggered by memory access patterns (like sequential accesses) and are asynchronous with respect to the execution flow.

² They are asynchronous by nature (the quickly complete locally) and may not pollute the core resources a load would (e.g. a register, the load unit and so on).

³ While one may think that a hint is at worst useless (if not respected) it can actually turns out that prefetch degrates the performance.

On Broadwell-E and subset of Haswell-E, there are processor-specific registers that control which cores have eviction rights to different parts of the L3. So it's actually possible to pull some data into L3, then disable eviction access to all cores. So you're left with data that's locked into L3 cache because nobody is allowed to evict it. Unfortunately, there's no similar feature to control the L1 and L2 caches. — Mysticial, Jan 23 '17 at 17:02
@Mysticial That's very interesting, thank you for sharing! I'm updating the answer — Margaret Bloom, Jan 23 '17 at 17:53
I just had a long discussion with my colleague over the cache-coherency implications of this with multiple sockets. And there's no clear answer from the Intel docs other than, "cache coherency is preserved". Since fundamentally, if two sockets lock the same cacheline, it's impossible for both to have fast access while maintaining coherency. Or even if one socket locks it and another socket writes to it thereby invalidating the (locked) copy. — Mysticial, Jan 23 '17 at 18:57
@Mysticial Do you happen to have the name of the register you mentioned earlier? I looked into the Manuals and the v3 datasheet but found nothing — Margaret Bloom, Jan 23 '17 at 20:45
See this (https://software.intel.com/en-us/articles/using-hardware-features-in-intel-architecture-to-achieve-high-performance-in-nfv). It's basically a massive abuse of CAT. — Mysticial, Jan 23 '17 at 20:51

How _mm_prefetch works?

1 Answers1