Avoiding cache pollution while loading a stream of numbers

Question

On x86 processors is there a way to load data from regular write back memory into registers without going through the cache hierarchy?

My use case is that I have a big look up structure (Hash map or B-Tree). I am working through a large stream of numbers (much bigger than my L3 but fits in memory). What I am trying to do is very simple:

 int result = 0;
 for (num : stream_numbers) {
     int lookup_result = lookup_using_b_tree(num);
     result += do_some_math_that_touches_registers_only(lookup_result);
 }
 return result;

Since I am visiting every number only once and the sum total of all numbers is more than the L3 size I imagine that they'll end up evicting some cache lines that hold parts of my B-tree. Instead I'd ideally like to not have any numbers from this stream hit cache since they have no temporal locality at all (only read once). That way I can maximize the chances that my B-tree remains in cache and look ups are faster.

I have looked at the (v)movntdqa instructions available in SSE 4.1 for temporal loads. That doesn't seem to be a good fit because it seems to only work for uncacheable write combining memory. This old article from Intel claims that:

Future generations of Intel processors may contain optimizations and enhancements for streaming loads, such as increased utilization of the streaming load buffers and support for additional memory types, creating even more opportunities for software developers to increase the performance and energy-efficiency of their applications.

However I am unaware of any such processor today. I have read elsewhere that a processor can just choose to ignore this hint for write back memory and use a movdqa instead. So is there any way I could achieve loads from regular write back memory without going through the cache hierarchy on x86 processors even if it is only possible on Haswell and later models? I'd also appreciate any information on if this will be possible in the future?

A similar question was asked recently, you might be interested in [this one](http://stackoverflow.com/q/28684812/417501), too. — fuz, Jun 18 '16 at 08:48
@FUZxxl: That one was clarified to be for benchmarking reasons, whic is quite different. This one is more like http://stackoverflow.com/questions/37889896/intel-instructions-for-access-to-memory-which-skips-cache, which isn't really a duplicate of the benchmarking question. — Peter Cordes, Jun 18 '16 at 09:49
AFAIK there's no reliable / guaranteed way to do this. `prefetchnta` might be helpful, but again, it's not clear if it can do anything useful, since it doesn't override the strong-ordering cache-coherency semantics of WB memory types. I think the best you can hope for is prefetchnta or movntdqa to load into cache and set the LRU data for that line to indicate that it would be a good eviction target. So if the hardware actually works that way, hopefully data from this stream will just evict previous lines from the same stream once it has an entry in each set. — Peter Cordes, Jun 18 '16 at 09:52
@PeterCordes Which is why I didn't mark it as being a duplicate. — fuz, Jun 18 '16 at 10:00
@FUZxxl: you did mark the other recent question (that I linked) as a dup of the benchmarking one. I'm not sure about that, but unless the OP of that question clarifies, I'm not going to vote to reopen. Anyway, none of that is relevant to this question >. — Peter Cordes, Jun 18 '16 at 10:02
If the ultimate source of your numbers is a file or the network or something else external to your program you might want to just read and process the numbers in smaller chunks. — Ross Ridge, Jun 18 '16 at 13:54
@PeterCordes That's what my research seemed to show as well. I found a an Intel person on their forums claiming that `prefetchnta` might help. Do you have any concrete sources or code that shows if `prefetchnta` or `movntdqa` actually help? — Rajiv, Jun 18 '16 at 17:16
@Rajiv: no :(. Most of my knowledge is theoretical, from reading docs / manuals, not so much from practical experience tuning real stuff. And I don't remember seeing anything about speeding up streaming loads from WB memory, just stores. But like I said, my best guess for what a microarch could do on WB memory is setting cache LRU data so lines will be evicted again easily. It might be possible to test that guess somehow. — Peter Cordes, Jun 19 '16 at 04:56
By the way, this might prove an interesting read - http://blog.stuffedcow.net/2013/01/ivb-cache-replacement . It's possible you needn't even bother with streaming loads, although of course only benchmarking could prove that. — Leeor, Jul 08 '16 at 20:38

Johan · Answer 1 · 2016-08-25T21:07:16.770

0

Yes, you can use MOVNTI to store values directly to memory without them touching the cache.

The latency of a MOVNTI is about 400 cycles (On Skylake).
However if you're just storing values you care little about latency and much more about reciprocal throughput, which is 1 cycle per MOVNTI.

Note that you need to perform an SFENCE or MFENCE after you are done with the stores.

According to my experimentation with MOVNTI (in the context of a ZeroMem routine) it is worth the effort if you're writing more than 512 KB.
The exact values will depend critically on the cache size etc.

The non-temporalness only applies to writes, not to reads!
In fact I don't know of any NT-mov variant that works in a non-temporal way when reading data.

However if you are doing a read-modify-write loop it makes little sense to use non-temporal moves.
You also need to take into account the locality of your node structure.
It likely looks like this:

left, right: pointer_to_node  (8 bytes aligned on 32 byte boundary).
data: integer;                (4 bytes) 
....

If this is so you reading the left/right node pointer will suck the data along within into the 32-byte(*) cache line.
Just doing a NT-mov on the data does not help here, it has already been sucked in when reading the other node data and thus is already in the cache.

The fact that compilers align data structure on cache friendly boundaries ensures that the maximum amount of node data gets hoovered into the cache with every node pointer access.

(*) cache line size is processor dependent.

edited Aug 25 '16 at 21:07

answered Aug 25 '16 at 20:50

Johan

74,508
24
191
319

1

movnti doesn't work as a load. The only NT load instruction is [MOVNTDQA](http://www.felixcloutier.com/x86/MOVNTDQA.html). It doesn't override the ordering semantics, so I don't think it can totally skip cache. http://stackoverflow.com/a/37891933/224132 – Peter Cordes Aug 25 '16 at 20:54
@PeterCordes, I press save early and often, so you can safely disregard early drafts :-). As far as I understand it MOVNTDQA (when reading) is more of a future proofing rather than an actual `performs as advertized` feature. Thanks for the heads up. – Johan Aug 25 '16 at 21:10
1

@PeterCordes I don't think it's possible to completely skip the cache when reading normal memory. But from the theoretical side, there are two things that *might* be useful. `prefetchnta` and `clflushopt`. If `prefetchnta` does what it's supposed to, it won't pollute the L2 and L3 caches. `clflushopt` is new to Skylake and is a fast (relaxed ordered) version of the old `clflush`. So you `prefetchnta` into L1. Load the entire cache-line, then `clflushopt` it out. Theoretically that shouldn't touch the L2/L3 and will minimize the impact on L1. – Mysticial Aug 25 '16 at 21:19
@Mysticial, `movnti` will completely skip the cache when writing. however when reading I agree. Not sure what AMD's victim (i.e. non WC) L3/L2 caches do though. – Johan Aug 25 '16 at 21:21
@Johan Streaming writes is a solved problem since `movnti` and family do what you expect. It's just the reads that are annoying. And fundamentally it's a difficult problem since reads are blocking instructions. – Mysticial Aug 25 '16 at 21:27
@Johan I've tested NT-stores on an AMD Piledriver. And they also help. The catch is that the penalty for not writing out an entire cache line is massive. So you either need to be streaming so much data that splitting cache-lines at the start/end of the stream don't matter. Or you need to make sure that you always write out entire-cache lines. And only using NT-stores. – Mysticial Aug 25 '16 at 21:37

Avoiding cache pollution while loading a stream of numbers

1 Answers1