1

I have a single threaded void function whose performance I call about, let's call it f. f takes as input a pointer to a float buffer of size around 1.5Mb, let's call x. f writes to another buffer, let's say y. y also has size around 1.5Mb. So to use f, we call f(x,y).

Now I run f 1000 times. In scenario one, I have ONE x and ONE y, so I do f(x,y) a thousand times. Reads of x by f are serviced from local caches and are fast.

In scenario two, I have ONE x and 1000 different y, think y0, y1 ... y999, each of which is a buffer of size around 1.5Mb. (contiguous in memory or not, doesn't matter apparently) When I do f(x,y0), f(x,y1), f(x,y2) ..., reads of x by f are no longer serviced from local caches! I observe LLC misses and get bottlenecked by DRAM latency.

What is going on here? I am running an Intel Kaby Lake quadcore laptop. i5-8250. L3 cache size 6144K.

bumpbump
  • 542
  • 4
  • 17
  • *Reads of x by f are serviced from local caches and are fast* - not private caches, those are only 256k L2 per core. The loads would have to go off-core (still on-chip) over the ring bus to the L3 cache. The L3 cache is fairly fast, but IDK if I'd call it "local". It's the least local CPU cache, being shared by all cores. – Peter Cordes Aug 19 '20 at 05:27
  • Re: scenario 2: cache pollution from stores is a real thing. But if you're just measuring perf events and seeing loads, keep in mind that stores need to RFO (read for ownership); they don't AFAIK detect a full-line write and optimize to just invalidate. And of course evicting dirty `y0` lines from L3 while storing `y1` or `y2` costs DRAM bandwidth. Manually using NT stores could help to avoid RFOs. See [Enhanced REP MOVSB for memcpy](https://stackoverflow.com/q/43343231) - your test with one read stream and 1 write stream is very similar to memcpy. – Peter Cordes Aug 19 '20 at 05:30
  • @PeterCordes can you explain your second comment more? Thanks a lot! – bumpbump Aug 19 '20 at 05:38
  • Did you read [Enhanced REP MOVSB for memcpy](https://stackoverflow.com/q/43343231)? It is the more explanation for that comment, that's why I linked it. – Peter Cordes Aug 19 '20 at 05:40
  • So I am pretty sure the DRAM loads are actually for x... I used Intel VTunes asm line by line profiler – bumpbump Aug 19 '20 at 05:45
  • 1
    If you have more detailed info, put it in your question so people can use it while coming up with explanations! However, note that perf events don't always get attributed to the correct instruction. (And some aren't tied to instructions at all, e.g. some off-core events just happen without having any specific instruction to blame.) But yeah if you're seeing counts for `mem_load_retired.l3_miss` that's probably the loads missing, not stores, regardless of which instruction it's on. And could indicate that your stores to `y0...999` are sometimes evicting `x`. – Peter Cordes Aug 19 '20 at 06:03
  • IvyBridge and later are supposed to be somewhat resistant to that kind of cache pollution, with an adaptive replacement policy. But it's likely it doesn't work perfectly. (http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/). Semi-related general stuff: [Which cache mapping technique is used in intel core i7 processor?](https://stackoverflow.com/q/49092541) – Peter Cordes Aug 19 '20 at 06:08
  • I suppose different replacement policy is why storing to y0 ... y999 is somehow worse than just storing to y0 1000 times? – bumpbump Aug 19 '20 at 06:09
  • 1
    No, regardless of replacement policy, storing to memory that's already hot in cache is always obvious better because it doesn't require any DRAM writes. (All caches in modern x86 are write-back). A clever replacement policy can just make it somewhat less bad by not evicting data does have some future value. – Peter Cordes Aug 19 '20 at 06:13
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/220068/discussion-between-bumpbump-and-peter-cordes). – bumpbump Aug 19 '20 at 06:14
  • 1
    At least, provide the relevant assembly code snippets and include the region of interest and any other parts that allocate or touch the buffers. If you want to show high-level code, provide the versions of the tools used to compile it. Also mention exactly how you're measuring performance events and which events are being measured. – Hadi Brais Aug 21 '20 at 23:33
  • Provide a MCVE to respect the answerer's time and have a much better chance at an accurate answer. – BeeOnRope Aug 22 '20 at 22:20

0 Answers0