I have a single threaded void function whose performance I call about, let's call it f. f takes as input a pointer to a float buffer of size around 1.5Mb, let's call x. f writes to another buffer, let's say y. y also has size around 1.5Mb. So to use f, we call f(x,y).
Now I run f 1000 times. In scenario one, I have ONE x and ONE y, so I do f(x,y) a thousand times. Reads of x by f are serviced from local caches and are fast.
In scenario two, I have ONE x and 1000 different y, think y0, y1 ... y999, each of which is a buffer of size around 1.5Mb. (contiguous in memory or not, doesn't matter apparently) When I do f(x,y0), f(x,y1), f(x,y2) ..., reads of x by f are no longer serviced from local caches! I observe LLC misses and get bottlenecked by DRAM latency.
What is going on here? I am running an Intel Kaby Lake quadcore laptop. i5-8250. L3 cache size 6144K.