Non-temporal loads and the hardware prefetcher, do they work together?

Question

When executing a series of _mm_stream_load_si128() calls (MOVNTDQA) from consecutive memory locations, will the hardware pre-fetcher still kick-in, or should I use explicit software prefetching (with NTA hint) in order to obtain the benefits of prefetching while still avoiding cache pollution?

The reason I ask this is because their objectives seem contradictory to me. A streaming load will fetch data bypassing the cache, while the pre-fetcher attempts to proactively fetch data into the cache.

When sequentially iterating a large data structure (processed data won't be retouched in a long while), it would make sense to me to avoid polluting the chache hierarchy, but I do not want to incur in frequent ~100 cycle penalties because the pre-fetcher is idle.

Target architecture is Intel SandyBridge

Good question. There's a `prefetchnta`, but I forget what I've read about this case. — Peter Cordes, Aug 19 '15 at 19:38
According to some older Intel docs, non-temporal loads are the same as normal aligned loads unless the memory is uncachable. My personal experience has confirmed that they make no performance difference on normal data. But this was back in the Nehalem/Sandy Bridge era. I have no idea if anything has changed for Haswell or Skylake. — Mysticial, Aug 19 '15 at 21:01
@PeterCordes `prefetchnta` pulls into L1 cache only rather than all the caches. That said, I have no idea how it interacts with the hardware prefetcher. In cases where the memory access is "random enough" for the hardware prefetcher to fail, but "sequential enough" to use full cachelines (as is the case in a lot of cache-blocking optimizations), I've found that software prefetching makes a huge difference in the absence of Hyperthreading. (~10%) But I've seen no observable difference between `prefetcht0` and `prefetchnta`. — Mysticial, Aug 19 '15 at 21:08
@Mysticial: L3 is inclusive on recent Intel designs, so L3 tags can be used for cache coherency checks. A cache line present in L1 but not L3 could get stale if another core modified that cache line, but I think IA32's cache coherency model disallows this (so it can't be implemented this way). `prefetchnta` was introduced in PIII days, before multi-core CPUs. I wouldn't be at all surprised if it did exactly the same thing as `prefetch0` on current designs, like how `lddqu` is now identical to `movdqu`. Perhaps `prefetchnta` makes cache lines more likely to be evicted again quickly. — Peter Cordes, Aug 19 '15 at 21:24
@PeterCordes Thanks for that insight on the caches. I've never thought about this from the perspective of cache coherency. — Mysticial, Aug 19 '15 at 21:38
@Mysticial: I went digging and found enough stuff to post an answer. It's not fully relevant to the OP, and raises more questions than it answers (esp. about how prefetchnta can pull something into L1 but not L3, given the coherency mechanism.) — Peter Cordes, Aug 20 '15 at 12:37

Peter Cordes · Answer 1 · 2022-07-17T16:24:42.203

SSE4.1 NT loads (MOVNTDQA) only do anything special on WC memory regions on current CPUs. On WB memory, they're just like normal loads but cost an extra ALU uop.

You have to use NT prefetch if you want to minimize cache pollution from normal memory. And those don't trigger HW prefetchers. I think this is partly because HW prefetchers don't have the ability to remember which streams are NT and which are normal. And on Intel CPUs, the main prefetcher (the "streamer") is in L2. But prefetchnta bypasses L2, so it never sees those prefetches.

SW NT prefetch is "brittle" in terms of tuning for the right prefetch distance, hard to use and specific to one machine. With a hard fall-off if you prefetch too far ahead and data starts getting evicted, since it's not there in L2 if it's dropped from L1d before it's needed.

See also How much of ‘What Every Programmer Should Know About Memory’ is still valid? - SW prefetch is generally a lot less useful because HW prefetchers are better than on P4. But NT prefetch to minimize pollution is still something you can only do with software.

According to Patrick Fay (Intel)'s Nov 2011 post:, "On recent Intel processors, prefetchnta brings a line from memory into the L1 data cache (and not into the other cache levels)." He also says you need to make sure you don't prefetch too late (HW prefetch will already have pulled it in to all levels), or too early (evicted by the time you get there).

As discussed in comments on the OP, current Intel CPUs have a large shared L3 which is inclusive of all the per-core caches. This means cache-coherency traffic only has to check L3 tags to see if a cache line might be modified somewhere in a per-core L1/L2. (Xeon (server) cores of Skylake and later no longer use inclusive L3, instead having a separate coherence directory or filter.)

IDK how to reconcile Pat Fay's explanation with my understanding of cache coherency / cache hierarchy. I thought if it does go in L1, it would also have to go in L3. Maybe L1 tags have some kind of flag to say this line is weakly-ordered? ~~My best guess is he was simplifying, and saying L1 when it actually only goes in fill buffers.~~ I think that was an over-simplification or applied only to older CPUs (before Nehalem) that didn't have inclusive L3. I think it has to get pulled into cache proper for cache-coherency reasons. And there aren't enough fill buffers to support a useful prefetch distance (reading far enough ahead).

BeeOnRope's answer points out that Intel's optimization manual says NT prefetch from WB memory fills L1d cache, and (on CPUs with inclusive L3 cache) one "way" of the set-associative L3 cache. So NT prefetch of a huge array will only pollute 1/16th of L3 or so.

This Intel guide about working with video RAM talks about non-temporal moves using load/store buffers, rather than cache lines. (Note that this may only the case for uncacheable memory.) It doesn't mention prefetch. It's also old, predating SandyBridge. However, it does have this juicy quote:

Ordinary load instructions pull data from USWC (aka WC) memory in units of the same size the instruction requests. By contrast, a streaming load instruction such as MOVNTDQA will commonly pull a full cache line of data to a special "fill buffer" in the CPU. Subsequent streaming loads would read from that fill buffer, incurring much less delay.

And then in another paragraph, says typical CPUs have 8 to 10 fill buffers. SnB/Haswell still have 10 per core.. Again, note that this may only apply to uncacheable memory regions.

movntdqa on WB (write-back) memory is not weakly-ordered (see the NT loads section of the linked answer), so it's not allowed to be "stale". Unlike NT stores, neither movntdqa nor prefetchnta change the memory ordering semantics of Write-Back memory.

I have not tested this guess, but prefetchnta / movntdqa on a modern Intel CPU could load a cache line into L3 and L1, but could skip L2 (because L2 isn't inclusive or exclusive of L1). The NT hint could have an effect by placing the cache line in the LRU position of its set, where it's the next line to be evicted. (Normal cache policy inserts new lines at the MRU position, farthest from being evicted. See this article about IvB's adaptive L3 policy for more about cache insertion policy).

(Actually it prefetches into 1 way of the set it's in, so the next NT prefetch will definitely evict the previous NT prefetch, not something else.)

Prefetch throughput on IvyBridge is only one per 43 cycles, so be careful not to prefetch too much if you don't want prefetches to slow down your code on IvB. Source: Agner Fog's insn tables and microarch guide. This is a performance bug specific to IvB. On other designs, too much prefetch will just take up uop throughput that could have been useful instructions (other than harm from prefetching useless addresses).

About SW prefetching in general (not the nt kind): Linus Torvalds posted about how they rarely help in the Linux kernel, and often do more harm than good. Apparently prefetching a NULL pointer at the end of a linked-list can cause a slowdown, because it attempts a TLB fill.

+1 Nice research! Yeah I completely disable prefetching on anything that targets Ivy Bridge. And I can confirm that prefetching nulls is a terrible idea. This was something I tried as a way to avoid having a "no prefetch" version of a specific function. Totally didn't work. VTune yelled at me for it. — Mysticial, Aug 20 '15 at 14:04
@Leeor: IvB can only retire one `prefetch*` instruction per 43 cycles. SnB and Haswell can retire one per 0.5 cycles. (They run on the load ports.) So overdoing it with prefetch can cause the prefetch instructions themselves to be a bottleneck on IvB, esp. when the data already is in cache. — Peter Cordes, Aug 20 '15 at 23:06
Thanks! Interesting, I wasn't expecting such a difference on an Intel "tick" project. I can't seem to reproduce it though... — Leeor, Aug 24 '15 at 07:54
@leeor maybe they fixed it in a later stepping? I don't have an ivb; did you test a loop of just prefetches? — Peter Cordes, Aug 25 '15 at 12:16
I tested a loop of independent prefetches (L1 resident, to avoid memory limitations), got a throughput of 0.5. I think i'll open a question about this later, maybe i'm doing something wrong. — Leeor, Aug 25 '15 at 12:19
@PeterCordes, I greatly appreciate the research. As you correctly mention, this actually leaves me with more questions though ;) I guess that my best bet is to code the three approaches and profile on the target platform (1. non streaming loads, 2. streaming loads, 3. streaming loads with explicit nta prefetch). I'll post my findings. Thanks again — BlueStrat, Aug 25 '15 at 16:52
I recently experimented with this a bit. The three cases I tried were: 1) Streaming loads (alone). 2) And streaming loads + prefetcht0. 3) Streaming loads + prefetchnta. | Streaming loads (alone) was the slowest. The other two were each 3% faster, with no distinguishable difference the two types of prefetch. — Mysticial, Oct 09 '15 at 19:06
When I look at it under VTune, case 1 (streaming loads alone), shows all the time being spent in those loads. No surprise here, they're coming from memory. In cases 2 and 3 (with the prefetch), VTune shows that all the time is spent in the prefetches themselves and zero time spent in the streaming loads. This hit me as a surprise since it suggests that there's a limited number of in-flight prefetches, and they will block execution when the limit is reached. If they didn't block, the penalty should still show up in the loads if the memory controller can't keep up with the prefetch requests. — Mysticial, Oct 09 '15 at 19:24
@PeterCordes So my recent experiments on Skylake X has led be down this path again. `prefetchnta` and `prefetcht0` are no longer the same. For a prefetch distance of ~700 instructions on a memory-bound loop (about 1500 cycles in this case), `prefetchnta` is actively harmful while `prefetcht0` is very beneficial. (no prefetch: 26.6s, `prefetchnta`: 44.6s, `prefetcht0`: 22.1s, `prefetcht1`: 23.7) — Mysticial, Dec 01 '17 at 04:09
I don't know what's going on. It's almost as if the `prefetchnta` pulls the data into cache, then loses it immediately. Then the loads pull them in again - thus consuming twice the bandwidth. The spatial memory footprint of those 700 instructions is only about 4k - which fits comfortably into L1. Perhaps the 1500 cycles is long enough for the other hyperthread to completely wipe the L1. — Mysticial, Dec 01 '17 at 04:13
@Mysticial: Intel's manuals imply that `prefetchNTA` fetches into L1D and (into one way of) L3, bypassing L2. On SKX, perhaps it also bypasses L3 because it's not inclusive anymore (and only updates some kind of tags). Maybe SKX also has pollution-limitation in L1D by only fetching into one way of any given set? `32/8 = 4`, so 4kiB is just barely big enough to step on data before you get to it if NT prefetch is only using a single way of L1D. (IDK if that's a *likely* design change, but try smaller prefetch distance). Otherwise maybe it's a design bug of some sort... — Peter Cordes, Dec 01 '17 at 04:29
@Mysticial: Do we know for sure whether `prefetchNTA` actually bypasses L3 on SKX? I've been assuming it does based on your results and @Bee's answer on [Do current x86 architectures support non-temporal loads (from "normal" memory)?](https://stackoverflow.com/a/49844984), but that's not definitive. Care to weigh in on [Difference between PREFETCH and PREFETCHNTA instructions](https://stackoverflow.com/posts/comments/93463577)? — Peter Cordes, Nov 13 '18 at 23:41
@PeterCordes Can't say for sure. But it certainly looks like it for the specific case I was using. — Mysticial, Nov 14 '18 at 00:06
@Mysticial: Hadi Brais suggested an experiment that answer the question, prefetching on one core and then loading on another core, because L3 hit time is different than core-to-core or DRAM latency. [Difference between PREFETCH and PREFETCHNTA instructions](https://stackoverflow.com/posts/comments/93463577). Please reply there if you (or any other future reader) has the time and curiosity to try that out. — Peter Cordes, Nov 14 '18 at 00:09
@PeterCordes worth noting the TLB miss on prefetch NULL is pre-skylake: [SDM](https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf#page=35): "Reduced performance penalty for a software prefetch that specifies a NULL pointer. ". I tested (on icelake, but imagine it changed on skylake) and didnt see ```DTLB_LOAD_MISSES``` for addresses [0:4095]. — Noah, Feb 26 '21 at 21:45

BeeOnRope · Answer 2 · 2018-02-26T20:59:30.477

I recently made some tests of the various prefetch flavors while answering another question and my findings were:

The results from using prefetchnta were consistent with the following implementation on Skylake client:

prefetchnta loads values into the L1 and L3 but not the L2 (in fact, it seems the line may be evicted from the L2 if it is already there).
It seems to load the value "normally" into L1, but in a weaker way in L3 such that it is evicted more quickly (e.g., only into a single way in the set, or with its LRU flag set such that it will be the next victim).
prefetchnta, like all other prefetch instructions, use a LFB entry, so they don't really help you get additional parallelism: but the NTA hint can be useful here to avoid L2 and L3 pollution.

The current optimization manual (248966-038) claims in a few places that prefetchnta does bring data into the L2, but only in one way out of the set. E.g., in 7.6.2.1 Video Encoder:

The prefetching cache management implemented for the video encoder reduces the memory traffic. The second-level cache pollution reduction is ensured by preventing single-use video frame data from entering the second-level cache. Using a non-temporal PREFETCH (PREFETCHNTA) instruction brings data into only one way of the second-level cache, thus reducing pollution of the second-level cache.

This isn't consistent with my test results on Skylake, where striding over a 64 KiB region with prefetchnta shows performance almost exactly consistent with fetching data from the L3 (~4 cycles per load, with an MLP factor of 10 and an L3 latency of about 40 cycles):

                                 Cycles       ns
         64-KiB parallel loads     1.00     0.39
    64-KiB parallel prefetcht0     2.00     0.77
    64-KiB parallel prefetcht1     1.21     0.47
    64-KiB parallel prefetcht2     1.30     0.50
   64-KiB parallel prefetchnta     3.96     1.53

Since the L2 in Skylake is 4-way, if the data was loaded into one way it should just barely stay in the L2 cache (one way of which covers 64 KiB), but the results above indicate that it doesn't.

You can run these tests on your own hardware on Linux using my uarch-bench program. Results for old systems would be particularly interesting.

Skylake Server (SKLX)

The reported behavior of prefetchnta on Skylake Server, which has a different L3 cache architecture, is significantly different from Skylake client. In particular, user Mysticial reports that lines fetched using prefetchnta are not available in any cache level and must be re-read from DRAM once they are evicted from L1.

The mostly likely explanation is that they never entered L3 at all as a result of the prefetchnta - this is likely since in Skylake server the L3 is a non-inclusive shared victim cache for the private L2 caches, so lines that bypass the L2 cache using prefetchnta are likely never to have a chance to enter the L3. This makes prefetchnta both more pure in function: fewer cache levels are polluted by prefetchnta requests, but also more brittle: any failure to read an nta line from L1 before it is evicted means another full roundtrip to memory: the initial request triggered by the prefetchnta is totally wasted.

According to Intel's manuals, `prefetchnta` only uses one way per set in L3, limiting pollution to 1/n of the n-way set-associative cache. (This applies to CPUs new enough to have an inclusive L3. I'm curious what SKX will do, where L3 is no longer inclusive.) — Peter Cordes, Dec 24 '17 at 17:49
@PeterCordes - yeah maybe it won't load it at all into the L3. Do we know if the L3 still has tags for all lines in the L1/L2 so it can act as a snoop filter? Where do you see that info in Intel's manual? I took a scan of the current optimization manual (248966-038) and every place it it says something explicit is that "brings data into only one way of the **second-level cache**". I never saw any mention of L3 behavior. A lot of the text is still mentioning concerns relating to P4 and other ancient architectures though. — BeeOnRope, Dec 24 '17 at 18:22
optimization manual, june 2016 version. Section 7.3.2: "*Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer microarchitectures: must fetch into 3rd level cache with fast replacement*", page 280. For "Core" processors based on those uarches (i.e. "core i7"), it's "may" instead of "must", and describes the bypassing L2. — Peter Cordes, Dec 24 '17 at 19:23
I think SKX must still have inclusive tags to track what's cached in inner caches. IDK if that's separate, or implemented as extra ways in L3, or what kind of designs are possible. Actually sending snoop requests all over the place isn't plausible. All I've read is guesswork based on patents and KNL: https://www.anandtech.com/show/11550/the-intel-skylakex-review-core-i9-7900x-i7-7820x-and-i7-7800x-tested/5. But that article isn't very good on microarchitectural details for stuff other than cache; many errors like saying the IDQ is 128 uops with HT disabled. — Peter Cordes, Dec 24 '17 at 19:27
I guess my copy of the PDF is out of date: I was looking at 248966-033, not -38. The places that say L2 should probably say "last level". (Except on Broadwell where eDRAM can technically be the last-level cache, but I think LLC on Broadwell would still normally refer to the L3 even on CPUs with eDRAM. And BTW, SKL with eDRAM uses it as a memory-side cache, not a last-level cache.) — Peter Cordes, Dec 24 '17 at 19:30
Oh great, somehow I missed that page while scanning the prefetchnta references. The whole page is pretty interesting. I see it says that `nt1` and `nt2` are identical, my testing showed a difference for L1 contained sizes, but perhaps it was just a testing artifact. About the manual versions, it's weird: the primary Intel download link seems to be the old manual, but this new one is linked from several other places (e.g., [sandpile](http://www.sandpile.org/x86/ref_docs.htm)). @PeterCordes — BeeOnRope, Dec 24 '17 at 19:47
@PeterCordes - some repeated runs showed that `nt1` and `nt2` numbers bounce around a bunch for the 16 KiB test, and so any apparent difference is likely just noise. Over several runs I never really saw any consistent difference, so I edited the text to reflect that. — BeeOnRope, Apr 15 '18 at 18:07

Leeor · Answer 3 · 2015-08-25T06:44:31.700

This question got me to do some reading... Looking at the Intel manual for MOVNTDQA (using a Sep'14 edition), there's an interesting statement -

A processor implementation may make use of the non-temporal hint associated with this instruction if the memory source is WC (write combining) memory type. An implementation may also make use of the non-temporal hint associated with this instruction if the memory source is WB (write back) memory type.

and later on -

The memory type of the region being read can override the non-temporal hint, if the memory address specified for the non-temporal read is not a WC memory region.

So there appears to be no guarantee that the non-temporal hint will do anything unless your mem type is WC. I don't really know what the WB memtype comment means, maybe some Intel processors do allow you to use it for the benefits of reducing cache pollution, or maybe they wanted to keep this option for the future (so you don't start using MOVNTDQA on WB mem and assume it would always behave the same), but it's quite clear that WC mem is the real use-case here. You want this instruction to provide some short-term buffering for stuff that would otherwise be completely uncacheable.

Now, on the other hand, looking at the description for prefetch*:

Prefetches from uncacheable or WC memory are ignored.

So that pretty much closes the story - your thinking is absolutely correct, these two are probably not meant and not likely to work together, chances are that one of them will be ignored.

Ok, but is there a chance these 2 would actually work (if the processor implements NT loads for WB memory)? Well, reading from MOVNTDQA again, something else catches the eye:

Any memory-type aliased lines in the cache will be snooped and flushed.

Ouch. So if you somehow do manage to prefetch into your cache, you're actually likely to degrade the performance of any consecutive streaming load, since it would have to flush the line out first. Not a pretty thought.

Thanks @Leeor, as I was replying to Peter, I will code the three approaches and profile and postback the results =) — BlueStrat, Aug 25 '15 at 16:57

Hadi Brais · Answer 4 · 2022-02-14T00:40:34.960

^{Note: I wrote this answer when I was less knowledgeable, but I think it's still OK and useful.}

Both MOVNTDQA (on WC memory) and PREFETCHNTA do not affect or trigger any of the cache hardware prefetchers. The whole idea of the non-temporal hint is to completely avoid cache pollution or at least minimize it as much as possible.

There is only a very small number (undocumented) of buffers called streaming load buffers (these are separate from the line fill buffers and from the L1 cache) to hold cache lines fetched using MOVNTDQA. So basically you need to use what you fetch almost immediately. In addition, MOVNTDQA only works on WC memory on most Intel processors. On Intel ADL's GLC cores, MOVNTDQA on a memory location of type WB, a non-temporal protocol is used by default. The WB ordering semantics are still preserved, though, because the NT hint can never override the effective memory type in any case. This is not a breaking change and is consistent with the documentation.

The PREFETCHNTA instruction is perfect for your scenario, but you have to figure out how to use it properly in your code. From the Intel optimization manual Section 7.1:

If your algorithm is single-pass use PREFETCHNTA. If your algorithm is multi-pass use PREFETCHT0.

The PREFETCHNTA instruction offers the following benefits:

It fetches the particular cache line that contains the specified address into at least the L3 cache and/or potentially higher levels of the cache hierarchy (see Bee's and Peter's answer and Section 7.3.2). In every cache level that it gets cached in, it might/should/more likely be considered the first to be evicted in case there is a need to evict a line from the set. In an implementation of a single-pass algorithm (such as computing the average of a large array of numbers) that is enhanced with PREFETCHNTA, later prefetched cache lines can be placed in the same block as those lines that were also prefetched using PREFETCHNTA. So even if the total amount of data being fetched is massive, only one way of the whole cache will get affected. The data that resides in the other ways will remain cached and will be available after the algorithm terminates. But this is a double-edged sword. If two PREFETCHNTA instructions are too close to each other and if the specified addresses map to the same cache set, then only one will survive.
Cache lines prefetched using PREFETCHNTA are kept coherent like any other cached lines using the same hardware coherence mechanism.
It works on the WB, WC, and WT memory types. Most probably your data is stored in WB memory.
Like I said before, it does not trigger hardware prefetching. It is for this reason why it can also be used to improve the performance of irregular memory access patterns as recommended by Intel.

The thread that executes PREFETCHNTA may not be able to effectively benefit from it depending on the behavior of any other running threads on the same physical core, on other physical cores of the same processor, or on cores of other processors that share the same coherence domain. Techniques such as, pinning, priority boosting, CAT-based cache partitioning, and disabling hyperthreading may help that thread to run efficiently. Note also that PREFETCHNTA is classified as a speculative load and so it is concurrent with the three fence instructions.

`movntdqa` on WB memory ignores the NT hint, on current Intel hardware. So it *does* trigger regular prefetch, and runs lie `movdqa` + an ALU uop. (Otherwise it would have bad throughput from only doing demand misses, which is probably why it ignores the NT hint. I have a half-finished update to my answer on this question which says that in more detail.) Anyway, that's why SW NT prefetch is the only option for minimizing load pollution on WB memory, on current hardware, but it's brittle especially on SKX where L3 is non-inclusive; early eviction means reload all the way from DRAM. — Peter Cordes, May 17 '18 at 07:58
How are you sure `prefetchnta` has special handling (filling only a single way and/or being marked "evict next") in _all_ levels of cache the line is populated in? When I tested it, I found that it seems to have special handling in L3 (i.e., it only used a portion of L3), but not in L1 (i.e., it seemed to behave normally there, being able to use all 32 KiB and not being evicted first). The lines didn't seem to be brought into L2 at all. — BeeOnRope, May 17 '18 at 18:23
@BeeOnRope Yea It's not really a guarantee. Actually, supporting that has some small hardware overhead (you need an NT attribute bit with every fetched cache line + the relevant logic to handle it), so it might not be implemented. — Hadi Brais, May 17 '18 at 18:42
Well only fetching into one line of L1 would be _very_ fragile also, since any access to the same set would clobber it, and given the small size and high associativity of the L1, and that applications usually don't control exactly the page offset of all their memory accesses this would be very likely. Also, it would make `prefetchnta` all-but-useless for any code that is accessing more than one stream of memory (since any additional stream would almost certainly clobber the NTA accesses out of L1). — BeeOnRope, May 17 '18 at 18:56
So I think even ignoring hardware costs, you wouldn't want to implement it exactly like that in L1, or it would very hard to use effectively. It's more about avoiding the pollution of the other caches, which are much larger and hence imply a much higher total cost when you fully pollute them. — BeeOnRope, May 17 '18 at 18:57

Non-temporal loads and the hardware prefetcher, do they work together?

4 Answers4

Skylake Server (SKLX)

Linked