Slowdown when accessing data at page boundaries?

Question

(My question is related to computer architecture and performance understanding. Did not find a relevant forum, so post it here as a general question.)

I have a C program which accesses memory words that are located X bytes apart in virtual address space. For instance, for (int i=0;<some stop condition>;i+=X){array[i]=4;}.

I measure the execution time with a varying value of X. Interestingly, when X is the power of 2 and is about page size, e.g., X=1024,2048,4096,8192..., I get to huge performance slowdown. But on all other values of X, like 1023 and 1025, there is no slowdown. The performance results are attached in the figure below.

I test my program on several personal machines, all are running Linux with x86_64 on Intel CPU.

What could be the cause of this slowdown? We have tried row buffer in DRAM, L3 cache, etc. which do not seem to make sense...

Update (July 11)

We did a little test here by adding NOP instructions to the original code. And the slowdown is still there. This sorta veto the 4k alias. The cause by conflict cache misses is more likely the case here.

You're probably going to need to mention which CPU it is (an Intel Atom is very different to Netburst, and they're both very different to Sandy Bridge which is slightly different to Haswell; and even slightly different CPU models within the same micro-arch have different cache sizes), plus if the pages are populated/present, how large the array is (if it's too big for all the translations to fit in CPU's TLBs and/or some caches) and if the TLBs are "warmed". For a random guess, I'd be suspecting an aliasing condition somewhere (maybe L2 data cache?). — Brendan, Jul 06 '19 at 00:47
What does the y-axis represent? Show the whole code used to produce the graph, how it was compiled, and on what CPU it was run. — Hadi Brais, Jul 06 '19 at 09:49
I don't think 4k aliasing matters for pure stores, only a mix of stores and loads. A store can be written to the store buffer without figuring out whether it overlaps an earlier or later store. That's why I said in my answer here that 4k aliasing is probably *doesn't* explain it, even though it's often a problem with large strides. — Peter Cordes, Jul 10 '19 at 00:54
Yes, I agree the cause may be cache miss and will update the OP. — Richard, Jul 11 '19 at 18:10
Adding NOPs doesn't show anything. The load and store don't have to be back-to-back for 4K aliasing to occur. It's difficult to help you if you don't provide the information I asked for in my earlier comment. — Hadi Brais, Jul 11 '19 at 18:26
See https://stackoverflow.com/questions/57344826/how-does-cache-associativity-impact-performance — Rick James, Aug 17 '19 at 21:42

Peter Cordes · Accepted Answer · 2021-03-11T00:10:07.847

There's 2 things here:

Set-associative cache aliasing creating conflict misses if you only touch the multiple-of-4096 addresses. Inner fast caches (L1 and L2) are normally indexed by a small range of bits from the physical address. So striding by 4096 bytes means those address bits are the same for all accesses so you're only one of the sets in L1d cache, and some small number in L2.

Striding by 1024 means you'd only be using 4 sets in L1d, with smaller powers of 2 using progressively more sets, but non-power-of-2 distributing over all the sets. (Intel CPUs have used 32KiB 8-way associative L1d caches for a long time; 32K/8 = 4K per way. Ice Lake bumped it up to 48K 12-way, so the same indexing where the set depends only on bits below the page number. This is not a coincidence for VIPT caches that want to index in parallel with TLB.)

But with a non-power-of-2 stride, your accesses will be distributed over more sets in the cache. Performance advantages of powers-of-2 sized data? (answer describes this disadvantage)

Which cache mapping technique is used in intel core i7 processor? - shared L3 cache is resistant to aliasing from big power-of-2 offsets because it uses a more complex indexing function.
4k aliasing (e.g. in some Intel CPUs). Although with only stores this probably doesn't matter. It's mainly a factor for memory disambiguation, when the CPU has to quickly figure out if a load might be reloading recently-stored data, and it does so in the first pass by just looking just at page-offset bits.

This is probably not what's going on for you, but for more details see:
L1 memory bandwidth: 50% drop in efficiency using addresses which differ by 4096+64 bytes and
Why are elementwise additions much faster in separate loops than in a combined loop?

Either or both of these effects could be a factor in Why is there huge performance hit in 2048x2048 versus 2047x2047 array multiplication?

Another possible factor is that HW prefetching stops at physical page boundaries. Why does the speed of memcpy() drop dramatically every 4KB? But changing a stride from 1024 to 1023 wouldn't help that by a big factor. "Next-page" prefetching in IvyBridge and later is only TLB prefetching, not data from the next page.

I kind of assumed x86 for most of this answer, but the cache aliasing / conflict-miss stuff applies generally. Set-associative caches with simple indexing are universally used for L1d caches. (Or on older CPUs, direct-mapped where each "set" only has 1 member). The 4k aliasing stuff might be mostly Intel-specific.

Prefetching across virtual page boundaries is likely also a general problem.

There is no temporal locality in the loop, so I think cache set mapping hardly matters (except when the size of the array is small enough to fit in the private caches, but large enough for writebacks to potentially impact performance by causing congestion at the L2 and L3 ports). There are apparently no loads in the array, so 4K aliasing cannot occur. The expected impact from hardware prefetching and TLB misses is not consistent with the graph from the OP. Although I suspect that the code the OP has shown is not the one used to produce the graph. — Hadi Brais, Jul 06 '19 at 09:56
@HadiBrais: I was assuming this was inside a repeat loop, but that might have been a bad assumption. I wonder if there could be some kind of impact from stores aliasing the same set as other fairly-recent stores. Like if somehow we ended up having to wait for write-back when we wouldn't otherwise? Or maybe early eviction of some loads done by HW prefetch so they have to be redone before store commit? That could cost a lot of time. — Peter Cordes, Jul 06 '19 at 10:12
L1 writebacks have lower priority than other transactions, so they are done when the bus is available. However, when the L1D writeback buffers are full, writebacks get higher priority, which can delay other transactions (loads, RFOs, write-through stores). The `OFFCORE_REQUESTS_BUFFER.SQ_FULL` event can be used to count the number of cycles on which this situation occurs. The number of L1 writeback buffers is 1 on the Pentium Pro and has increased to 4 at some point, and then to 6 later (on Haswell/Broadwell/Skylake). Note that writebacks take the same path as other transactions. — Hadi Brais, Jul 06 '19 at 10:31
On processors with a non-inclusive L3, dirty lines in the L2 are likely to be written back to the L3, but if the line is no longer in the L3, it has to be fetched from its home node, which costs in terms of memory bandwidth, L3 bandwidth, and L3 space. — Hadi Brais, Jul 06 '19 at 10:36
Why do you assume the next-page prefetcher is for the TLB only? — Leeor, Jul 06 '19 at 20:00
@Leeor: because apparently that's what it is: a TLB prefetcher, not a page-crossing data prefetcher. Unless I've been misinformed by people on SO (e.g. @Brendan) who corrected me when I assumed from the name that it was a data prefetcher. — Peter Cordes, Jul 06 '19 at 20:02

Slowdown when accessing data at page boundaries?

1 Answers1

Linked