How to read large amounts of data without contaminating the cache?

Question

I am trying to do performance optimization on my code which does image processing. For example, unsharp masking. It applies a calculation on a square region around each pixel of the image, in raster order.

I want to check whether copying several lines of the image to a dedicated "work area", while bypassing the cache, will help. The idea is, data from the image will not evict other useful data from the cache, which should improve performance.

How can I implement a special form of memcpy, which doesn't update the cache?

I don't use OpenCV, but if it has such support, I am ready to try it.

I don't want to mark the whole image as an uncached area, because I have many algorithms running on it, and I want to measure the effect of my optimization attempt on just one algorithm.

Hi, this might be of interest https://stackoverflow.com/questions/9544094/how-to-mark-some-memory-ranges-as-non-cacheable-from-c — jspcal, Mar 01 '19 at 23:26
@jspcal: you don't actually want x86 uncacheable memory regions, unless maybe you have AVX512 so you can still read the whole cache line with one DDR burst. (Every separate access to uncacheable memory results in a separate actual access to DRAM, so two 32-byte loads would be horrible.) — Peter Cordes, Mar 01 '19 at 23:49
I've done a similar thing for convolving an image with a smaller "image" that corrects for various effects in a scanner. Generally, the best performance seems to be to copy square portions of the main image into a smaller array so you stay within the L caches. The subimages have to overlap as needed based on the convolved image size. This approach works really well multithreaded too when the full image is > 20MB. A simple approach is to treat each of the RGB planes separately and run 3 threads on them. — doug, Mar 02 '19 at 00:17
@doug: That's related to "loop tiling", aka cache blocking, but in tiling you don't actually copy, just access a square portion with a row stride wider than the width. (Works well unless the size is a large power of 2 and the rows alias each other in cache.) — Peter Cordes, Mar 02 '19 at 07:14
@PeterCordes Yep. Works well in cases like the OP's where the convolving matrix is not too wide. My situation was a bit unusual in that the convolving matrix was fairly large because it needed to correct for large area crosstalk, a problem with scanners. copying the entire tile to a separate memory area for each thread then convolving sped things up significantly. At least for images over about 20MB. It was a rather unusual situation. Your approach is likely better for the OP unless the size was far larger. — doug, Mar 02 '19 at 15:57

score 1 · Accepted Answer · answered Mar 02 '19 at 01:56

The way to do exactly what you want is to use the MOVNTDQA instruction in conjunction with the WC memory type. This reads from memory into a streaming load buffer instead of into the cache. Subsequent streaming loads to the same streaming line are supplied from the streaming load buffer. See section 12.10.3 in volume 1 of the SDM. This instruction was added with SSE4.1.

Additional references:
https://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers
https://www.embedded.com/print/4007238
(Note, I haven't read these thoroughly, so I don't know how useful they are.)

Note that MOVNTDQA is not ordered with respect to writes from other cores, but based on your description that doesn't seem to be a concern in your situation.

You definitely don't want to use UC memory type, because as Peter mentioned, each access results in a separate DRAM read, and to make it even worse, UC accesses are serializing, destroying any parallelism in your code.

Important to mention that on current CPUs, `movntdqa` loads are *only* special on WC memory. On normal WB memory, it's just a more expensive `movdqa` (may include an ALU uop.) If you have the same physical pages mapped twice, with WC and WB, you're all set to try it. (Unless the memory is *only* ever read with streaming reads+writes, then you only need WC. Which you still can't ask for with `mmap` or anything). Otherwise your main option AFAIK is `prefetchntdqa` + regular loads to do a pollution-minimizing prefetch. But SW prefetch tuning parameters can be brittle and system-dependent. — Peter Cordes, Mar 02 '19 at 07:11
To anyone who might wonder about the acronyms: WC = write-combining, WB = write-back, UC = uncached. — anatolyg, Jun 03 '20 at 08:02

How to read large amounts of data without contaminating the cache?

1 Answers1