2

I am working on an application that has quite a few internal data structures, but also processes huge amounts of user data. During this processing, I need to have the CPU look at the data just once (the rest of the processing is done via zero copies and DMA, so the CPU need not touch the data at all).

I am searching for a way to process the user data (even if it means copying it to a temporary buffer) without having it evict the internal structures from the CPU's data cache. In other words, I'm looking for a way to tell the CPU "give me this data, but I'm never going to need it again".

I seem to recall that gcc had an intrinsic to do it, but going over the list, I seem to have misremembered (or otherwise couldn't find it). Either way, assembly solution (Intel) would work fine for my purposes.

Logic states that there must be a way to do this, as it is necessary to do this before sending data to (or receiving from) DMA buffers.

Shachar Shemesh
  • 8,193
  • 6
  • 25
  • 57
  • 2
    SSE4.1 [MOVNTDQA](http://www.felixcloutier.com/x86/MOVNTDQA.html) is a load with a non-temporal *hint* which may give you that behaviour even on normal "write-back" memory regions. It's not guaranteed to behave any differently from a normal vector load, though. Using it to read your data directly, or to copy into a small buffer if you want to use non-vector code on it, should be better than nothing. (I recommend using the C intrinsic for MOVNTDQA, not writing inline asm.) – Peter Cordes Nov 23 '16 at 07:15
  • [`__builtin_prefetch`](http://stackoverflow.com/q/10323420/995714)? – phuclv Nov 23 '16 at 07:19
  • Also, if you're running on Intel IvyBridge or newer, [it's L3 cache uses an adaptive replacement policy to minimize cache pollution from looping over huge amounts of memory](http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/). Either way, unless you use a memory type other than WB (e.g. WC which is weakly-ordered), your data will have to go into cache, but hopefully it can only pollute one "way" of each set in the L3, L2, and L1 caches. – Peter Cordes Nov 23 '16 at 07:19
  • [How to write or read memory without touching cache](http://stackoverflow.com/q/28684812/995714), [What is the meaning of “non temporal” memory accesses in x86](http://stackoverflow.com/q/37070/995714), [How can I load values from memory without polluting the cache?](http://stackoverflow.com/q/1265469/995714) – phuclv Nov 23 '16 at 07:20
  • @LưuVĩnhPhúc: that's a good point, NT prefetch should hopefully do whatever is possible to minimize cache pollution while still making it ready. It's harder to tune than MOVNTDQA, though. – Peter Cordes Nov 23 '16 at 07:20
  • An excellent discussion of this topic by Ulrich Drepper: https://lwn.net/Articles/255364/ – David Nov 23 '16 at 07:27
  • @LưuVĩnhPhúc: most of those answers are actually talking about stores, and pretty much ignore NT loads (because SSE4.1 MOVNTDQA loads are still strongly-ordered on WB memory, and AFAIK there isn't much info on exactly how different microarchitectures implement the NT hint). [This is what I've been able to find](http://stackoverflow.com/questions/32103968/non-temporal-loads-and-the-hardware-prefetcher-do-they-work-together), with links to other stuff for more details on various aspects. AFAIK, PREFETCHNTA and MOVNTDQA might help, but there's no guarantee other than changing memory type. – Peter Cordes Nov 23 '16 at 07:30
  • 1
    It is controlled by the D bit in the page directory entry (cache Disabled). What knob you'll need to tell your OS about it is hard to guess. On MSVC++ it can be done with #pragma section, using the nocache attribute. Gcc/posix ought to have something similar, I'd guess at the ld script. – Hans Passant Nov 23 '16 at 07:59
  • @PeterCordes, I don't have a problem with the data going into the cache, as long as it doesn't overstay its welcome. I mostly want it not to push my internal data structures out. – Shachar Shemesh Nov 23 '16 at 14:24
  • Yeah, that's what I thought. You should try with/without PREFETCHNTA or MOVNTDQA, and see if it affects the cache miss-rate for the data that does get reused and should ideally be staying hot in cache. – Peter Cordes Nov 23 '16 at 22:08

0 Answers0