Consider the following programs:
#include <stdio.h>
#include <stdlib.h>
typedef unsigned long long u64;
int program_1(u64* a, u64* b)
{
const u64 lim = 50l * 1000l * 1000l;
// Reads arrays
u64 sum = 0;
for (u64 i = 0; i < lim * 100; ++i) {
sum += a[i % lim];
sum += b[i % lim];
}
printf("%llu\n", sum);
return 0;
}
int program_2(u64* a, u64* b)
{
const u64 lim = 50l * 1000l * 1000l;
// Reads arrays
u64 sum = 0;
for (u64 i = 0; i < lim * 100; ++i) {
sum += a[i % lim];
}
for (u64 i = 0; i < lim * 100; ++i) {
sum += b[i % lim];
}
printf("%llu\n", sum);
return 0;
}
Both programs are identical: they fill up an array with 1s, then read every element 100 times, adding to a counter. The only difference is the first one fuses the adder loop, while the second one performs two separate passes. Given that M1 has a 64KB of L1 data cache, my understanding is that the following would happen:
Program 1
sum += a[0] // CACHE MISS. Load a[0..8192] on L1.
sum += b[0] // CACHE MISS. Load b[0..8192] on L1.
sum += a[1] // CACHE MISS. Load a[0..8192] on L1.
sum += b[1] // CACHE MISS. Load b[0..8192] on L1.
sum += a[2] // CACHE MISS. Load a[0..8192] on L1.
sum += b[2] // CACHE MISS. Load b[0..8192] on L1.
(...)
Program 2
sum += a[0] // CACHE MISS. Load a[0..8192] on L1.
sum += a[1] // CACHE HIT!
sum += a[2] // CACHE HIT!
sum += a[3] // CACHE HIT!
sum += a[4] // CACHE HIT!
...
sum += a[8192] // CACHE MISS. Load a[8192..16384] on L1.
sum += a[8193] // CACHE HIT!
sum += a[8194] // CACHE HIT!
sum += a[8195] // CACHE HIT!
sum += a[8196] // CACHE HIT!
...
...
sum += b[0] // CACHE MISS. Load b[0..8192] on L1.
sum += b[1] // CACHE HIT!
sum += b[2] // CACHE HIT!
sum += b[3] // CACHE HIT!
sum += b[4] // CACHE HIT!
...
This would lead me to believe that the first program is slower, since every read is a cache miss, while the second one consists majorly of cache hits. The results, though, differ. Running on a Macbook Pro M1, with clang -O2
, the first program takes 2.8s to complete, while the second one takes about 3.8s.
What is wrong about my mental model of how the L1 cache works?