What are _mm_prefetch() locality hints?

Question

The intrinsics guide says only this much about void _mm_prefetch (char const* p, int i) :

Fetch the line of data from memory that contains address p to a location in the cache heirarchy specified by the locality hint i.

Could you list the possible values for int i parameter and explain their meanings?

I've found _MM_HINT_T0, _MM_HINT_T1, _MM_HINT_T2, _MM_HINT_NTA and _MM_HINT_ENTA, but I don't know whether this is an exhaustive list and what they mean.

If processor-specific, I would like to know what they do on Ryzen and latest Intel Core processors.

score 44 · Accepted Answer · edited Oct 25 '21 at 18:13

Sometimes intrinsics are better understood in terms of the instruction they represent rather than as the abstract semantic given in their descriptions.

The full set of the locality constants, as today, is

#define _MM_HINT_T0 1
#define _MM_HINT_T1 2
#define _MM_HINT_T2 3
#define _MM_HINT_NTA 0
#define _MM_HINT_ENTA 4
#define _MM_HINT_ET0 5
#define _MM_HINT_ET1 6
#define _MM_HINT_ET2 7

as described in this paper about Intel Xeon Phi coprocessor prefetching capabilities.

For IA32/AMD processors, the set is reduced to

#define _MM_HINT_T0 1
#define _MM_HINT_T1 2
#define _MM_HINT_T2 3
#define _MM_HINT_NTA 0
#define _MM_HINT_ET1 6

_mm_prefetch is compiled into different instructions based on the architecture and the locality hint

    Hint              IA32/AMD          iMC
_MM_HINT_T0           prefetcht0     vprefetch0
_MM_HINT_T1           prefetcht1     vprefetch1
_MM_HINT_T2           prefetcht2     vprefetch2
_MM_HINT_NTA          prefetchnta    vprefetchnta
_MM_HINT_ENTA              -         vprefetchenta
_MM_HINT_ET0               -         vprefetchet0
_MM_HINT_ET1          prefetchwt1    vprefetchet1
_MM_HINT_ET2               -         vprefetchet2

What the (v)prefetch instructions do, if all the requirements are satisfied, is to bring a cache line worth of data into the cache level specified by the locality hint.
The instruction is just a hint, it may be ignored.

When a line is prefetched into level X, the manuals (both Intel and AMD) say that it also fetched into all the other higher level (but for the case X=3).
I'm not sure if this is actually true, I believe that the line is prefetched with-respect-to cache level X and depending on the caching strategies of the higher levels (inclusive vs non-inclusive) it may or may not be present there too.

Another attribute of the (v)prefetch instructions is the non-temporal attribute.
A non-temporal data is unlikely to be reused soon.
In my understanding, NT data is stored in the "streaming load buffers" for the IA32 architecture¹ while for the iMC architecture it is stored in the normal cache (using as the way the hardware thread id) but with Most Recent Use replacement policy (so that it will be the next evicted line if needed).
For AMD the manual read that the actual location is implementation dependent, ranging from a software invisible buffer to a dedicated non-temporal cache.

The last attribute of the (v)prefetch instructions is the "intent" attribute or the "eviction" attribute.
Due to the MESI-and-variant protocols, a Request-for-ownership must be made to bring a line into an exclusive state (in order to modify it).
An RFO is just a special read, so prefetching it with an RFO will bring it into the Exclusive state directly (otherwise the first store to it will cancel the benefits of prefetching due to the "delayed" RFO needed), granted we know we will write to it later.

The IA32 and AMD architectures don't support and exclusive non-temporal hint (yet) since the way the non-temporal cache level is implementation-defined.
The iMC architecture allows for it with the locality code _MM_HINT_ENTA.

¹ Which I understand to be the WC buffers. Peter Cordes clarified this on a comment below: prefetchnta only uses the Line-Fill buffers if prefetching USWC memory regions. Otherwise it prefetches into L1

For reference here is the description of the instructions involved

PREFETCHh

Fetches the line of data from memory that contains the byte specified with the source operand to a location in the cache hierarchy specified by a locality hint:

• T0 (temporal data)—prefetch data into all levels of the cache hierarchy.
• T1 (temporal data with respect to first level cache misses)—prefetch data into level 2 cache and higher.
• T2 (temporal data with respect to second level cache misses)—prefetch data into level 3 cache and higher, or an implementation-specific choice.
• NTA (non-temporal data with respect to all cache levels)—prefetch data into non-temporal cache structure and into a location close to the processor, minimizing cache pollution.

PREFETCHWT1

Fetches the line of data from memory that contains the byte specified with the source operand to a location in the cache hierarchy specified by an intent to write hint (so that data is brought into ‘Exclusive’ state via a request for ownership) and a locality hint:

• T1 (temporal data with respect to first level cache)—prefetch data into the second level cache.

VPREFETCHh
                 Cache  Temporal    Exclusive state
                 Level
VPREFETCH0       L1     NO          NO
VPREFETCHNTA     L1     YES         NO
VPREFETCH1       L2     NO          NO
VPREFETCH2       L2     YES         NO
VPREFETCHE0      L1     NO          YES
VPREFETCHENTA    L1     YES         YES
VPREFETCHE1      L2     NO          YES
VPREFETCHE2      L2     YES         YES

`prefetchnta` only uses the Line-Fill buffers if prefetching USWC memory regions. Otherwise it prefetches into L1 (and L3 on CPUs with an inclusive L3), bypassing L2. (This is what Intel's optimization manual says). You can't do weakly-ordered loads from WB memory; there's no way to bypass cache coherency on WB. — Peter Cordes, Oct 02 '17 at 15:12
Oops, my previous comment isn't totally accurate. NT *stores* do bypass cache-coherency on WB memory. (Being weakly ordered is sort of the same thing as bypassing coherency. Weakly-ordered loads from WB memory are impossible, but prefetchNTA can supposedly reduce cache pollution. Oh yeah, Intel's manual also says that if prefetchNTA puts data into L3, it goes into only one way in any given set, so it still reduces pollution there. I have a half-finished answer with more details on this that I should finish and post...) — Peter Cordes, Oct 02 '17 at 17:20
@PeterCordes, very interesting. I'm looking forward to the answer of yours! — Margaret Bloom, Oct 02 '17 at 18:32
_"otherwise the first store to it will cancel the benefits of prefetching due to the "delayed" RFO needed_" Actually, it's often not as bad as that. Unless the line is actually shared it will come into the core in E state, so the first write will have to do an E -> M transition, but this is cheap and generally "local" (i.e., the core only needs to flip a bit in one of its private caches, either L1 or L2, so it's nothing like a miss to memory or the shared cache. In this sense, whether the initial request is an "correctly" flagged as RFO is mostly important for lines that are actually shared. — BeeOnRope, May 17 '18 at 05:23
Correction again to my prev comment: NT stores bypass *cache*, but not *coherency*: they still invalidate any copies of a line that other cores might have, perhaps even before they're allowed to commit from the store buffer to an LFB. (Maybe write-back for dirty data; they have to avoid ever "rolling back" stores from other cores before the NT store is visible to all cores.) Their coherency traffic is only invalidate, not RFO, though. (related: [Enhanced REP MOVSB for memcpy](https://stackoverflow.com/q/43343231) re: no-RFO store protocols.) Again, that's unrelated to NT loads. — Peter Cordes, Oct 25 '21 at 22:27

What are _mm_prefetch() locality hints?

1 Answers1

Linked