SSE4.1 NT loads (MOVNTDQA) only do anything special on WC memory regions on current CPUs. On WB memory, they're just like normal loads but cost an extra ALU uop.
You have to use NT prefetch if you want to minimize cache pollution from normal memory. And those don't trigger HW prefetchers. I think this is partly because HW prefetchers don't have the ability to remember which streams are NT and which are normal. And on Intel CPUs, the main prefetcher (the "streamer") is in L2. But prefetchnta
bypasses L2, so it never sees those prefetches.
SW NT prefetch is "brittle" in terms of tuning for the right prefetch distance, hard to use and specific to one machine. With a hard fall-off if you prefetch too far ahead and data starts getting evicted, since it's not there in L2 if it's dropped from L1d before it's needed.
See also How much of ‘What Every Programmer Should Know About Memory’ is still valid? - SW prefetch is generally a lot less useful because HW prefetchers are better than on P4. But NT prefetch to minimize pollution is still something you can only do with software.
According to Patrick Fay (Intel)'s Nov 2011 post:, "On recent Intel processors, prefetchnta brings a line from memory into the L1 data cache (and not into the other cache levels)." He also says you need to make sure you don't prefetch too late (HW prefetch will already have pulled it in to all levels), or too early (evicted by the time you get there).
As discussed in comments on the OP, current Intel CPUs have a large shared L3 which is inclusive of all the per-core caches. This means cache-coherency traffic only has to check L3 tags to see if a cache line might be modified somewhere in a per-core L1/L2. (Xeon (server) cores of Skylake and later no longer use inclusive L3, instead having a separate coherence directory or filter.)
IDK how to reconcile Pat Fay's explanation with my understanding of cache coherency / cache hierarchy. I thought if it does go in L1, it would also have to go in L3. Maybe L1 tags have some kind of flag to say this line is weakly-ordered? My best guess is he was simplifying, and saying L1 when it actually only goes in fill buffers. I think that was an over-simplification or applied only to older CPUs (before Nehalem) that didn't have inclusive L3. I think it has to get pulled into cache proper for cache-coherency reasons. And there aren't enough fill buffers to support a useful prefetch distance (reading far enough ahead).
BeeOnRope's answer points out that Intel's optimization manual says NT prefetch from WB memory fills L1d cache, and (on CPUs with inclusive L3 cache) one "way" of the set-associative L3 cache. So NT prefetch of a huge array will only pollute 1/16th of L3 or so.
This Intel guide about working with video RAM talks about non-temporal moves using load/store buffers, rather than cache lines. (Note that this may only the case for uncacheable memory.) It doesn't mention prefetch. It's also old, predating SandyBridge. However, it does have this juicy quote:
Ordinary load instructions pull data from USWC (aka WC) memory in units of the
same size the instruction requests. By contrast, a streaming load
instruction such as MOVNTDQA will commonly pull a full cache line of
data to a special "fill buffer" in the CPU. Subsequent streaming loads
would read from that fill buffer, incurring much less delay.
And then in another paragraph, says typical CPUs have 8 to 10 fill buffers. SnB/Haswell still have 10 per core.. Again, note that this may only apply to uncacheable memory regions.
movntdqa
on WB (write-back) memory is not weakly-ordered (see the NT loads section of the linked answer), so it's not allowed to be "stale". Unlike NT stores, neither movntdqa
nor prefetchnta
change the memory ordering semantics of Write-Back memory.
I have not tested this guess, but prefetchnta
/ movntdqa
on a modern Intel CPU could load a cache line into L3 and L1, but could skip L2 (because L2 isn't inclusive or exclusive of L1). The NT hint could have an effect by placing the cache line in the LRU position of its set, where it's the next line to be evicted. (Normal cache policy inserts new lines at the MRU position, farthest from being evicted. See this article about IvB's adaptive L3 policy for more about cache insertion policy).
(Actually it prefetches into 1 way of the set it's in, so the next NT prefetch will definitely evict the previous NT prefetch, not something else.)
Prefetch throughput on IvyBridge is only one per 43 cycles, so be careful not to prefetch too much if you don't want prefetches to slow down your code on IvB. Source: Agner Fog's insn tables and microarch guide. This is a performance bug specific to IvB. On other designs, too much prefetch will just take up uop throughput that could have been useful instructions (other than harm from prefetching useless addresses).
About SW prefetching in general (not the nt
kind): Linus Torvalds posted about how they rarely help in the Linux kernel, and often do more harm than good. Apparently prefetching a NULL pointer at the end of a linked-list can cause a slowdown, because it attempts a TLB fill.