Incrementing every byte of int8_t aArr[SIZE_L3];
separately is slow enough that hardware prefetchers are probably able to keep up pretty well a lot of the time. Out-of-order execution can keep a lot of read-modify-writes in flight at once to different addresses, but the best-case is still one byte per clock of stores. (Bottleneck on store-port uops, assuming this was a single-threaded test on a system without a lot of other demands for memory bandwidth).
Intel CPUs have their main prefetch logic in L2 cache (as described in Intel's optimization guide; see the x86 tag wiki). So successful hardware prefetch into L2 cache before the core issues a load means the that L3 cache never sees a miss.
John McCalpin's answer on this Intel forum thread confirms that L2 hardware prefetches are NOT counted as LLC references or misses by the normal perf events like MEM_LOAD_UOPS_RETIRED.LLC_MISS
. Apparently there are OFFCORE_RESPONSE
events you can look at.
IvyBridge introduced next-page HW prefetch. Intel Microarchitectures before that don't cross page boundaries when prefetching, so you still get misses every 4k. And maybe TLB misses if the OS didn't opportunistically put your memory in a 2MiB hugepage. (But speculative page-walks as you approach a page boundary can probably avoid much delay there, and hardware definitely does do speculative page walks).
With a stride of 64 bytes, execution can touch memory much faster than the cache / memory hierarchy can keep up. You bottleneck on L3 / main memory. Out-of-order execution can keep about the same number of read/modify/write ops in flight at once, but the same out-of-order window covers 64x more memory.
Explaining the exact numbers in more details
For array sizes right around L3, IvyBridge's adaptive replacement policy probably makes a significant difference.
Until we know the exact uarch, and more details of the test, I can't say. It's not clear if you only ran that loop once, or if you had an outer repeat loop and those miss / reference numbers are an average per iteration.
If it's only from a single run, that's a tiny noisy sample. I assume it was somewhat repeatable, but I'm surprised the L3 references count was so high for the every-byte version. 4 * 1024^2 / 64 = 65536
, so there was still an L3 reference for most of the cache lines you touched.
Of course, if you didn't have a repeat loop, and those counts include everything the code did besides the loop, maybe most of those counts came from startup / cleanup overhead in your program. (i.e. your program with the loop commented out might have 48k L3 references, IDK.)
I have tested this with a dynamically allocated array
Totally unsurprising, since it's still contiguous.
and on a CPU that has a larger L3 cache (8MB) and I get a similar pattern in the results.
Did this test use a larger array? Or did you use a 4MiB array on a CPU with an 8MiB L3?