I am trying to demonstrate cache associativity on Intel CPUs by accessing memory with a stride equal to a power of two. I access a K x N matrix of doubles in column-wise fashion (in C++), and expect that for K > 8 the performance will be degraded if N is a power of two (because I'm working on CPUs with 8-way set associative L1 cache). I therefore plot the performance of memory access for K=1..40, for N=2^20, N=2^20+8 and N=2^20+64, thus making a 0, 1 and 8 cache lines padding, respectively.
#include <cstdio>
#include <cstdlib>
#include <chrono>
double buffer[((1 << 20) + 64) * 40];
void measure_flops(int N, int K) {
// Warm-up.
for (int j = 0; j < 10; ++j)
for (int i = 0; i < N * K; ++i)
buffer[i] += 1e-10 * i + 0.123 * j;
// Iterate.
int repeat = 500 / K;
auto start = std::chrono::steady_clock::now();
for (int r = 0; r < repeat; ++r)
for (int i = 0; i < N; ++i)
for (int j = 0; j < K; ++j)
buffer[j * N + i] += 1.0123;
auto end = std::chrono::steady_clock::now();
std::chrono::duration<double> diff = end - start;
volatile double tmp = buffer[rand() % (N * K)]; (void)tmp;
// Report.
double flops = (double)repeat * N * K / diff.count();
printf("%d %2d %.4lf\n", N, K, flops * 1e-9);
fflush(stdout);
}
void run(int N) {
printf(" N K GFLOPS\n");
for (int K = 1; K <= 40; ++K)
measure_flops(N, K);
printf("\n\n");
}
int main() {
const int N = 1 << 20;
run(N);
run(N + 64 / sizeof(double));
run(N + 512 / sizeof(double));
return 0;
}
I run the benchmark on the following CPUs:
- Intel Xeon E5-2680 v3 (Haswell, associativity is 8, 8, 20 for L1, L2 and L3, respectively)
- Intel Xeon E5-2670 v3 (Haswell, associativity is 8, 8, 20)
- Intel Xeon Gold 6150 (Skylake, associativity is 8, 16, 11)
...and I get inconsistent result and certain shapes I don't understand.
My questions are:
1) Why did the performance not recover for padding=64 B for E5-2680, if it did for E5-2670? Performance recovers for padding=512 B, but why? (Is it possible that prefetcher thrashes the cache?)
EDIT: Tested on E5-2690 and E5-2650 too, on another cluster. E5-2690 behaves as E5-2680 (bad performance for padding=64 B), while E5-2650 behaves as E5-2670 (good performance). Thus, it seems like really something changed between 2670 and 2680.
2) Why does the performance drop so early, before K reaches 8? (I suspect it is the "bogus store forwarding stall" as explained in a somewhat different setup here, but I'm not sure.)
3) Why does then again the performance drop for K=~32 for E5-2680 and K=~35 for Gold 6150?
(perf stat -ddd
shows an increased number of L3 cache misses there, but I don't see the exact reason why, as K does not match L3 associativity.)
Setup details:
E5-2680 and Gold 6150 are both running on a cluster, where I take a full node to make sure there are no interference with other users. A node contains 2 CPUs in both cases, but I anyway use a single core only. E5-2670 is a on a single-CPU shared machine (cannot get a full node), but ran the benchmark when it was inactive.
I compiled the code with g++ -O1 -std=c++11
(version 6.3.0 for E5-2680 and Gold, version 7.3.0 for E5-2670). All results are reproducible. Changing the compiler (or for example adding -O3
, or adding volatile
to buffer
) changes more or less only results for low K, e.g. K=2 gets faster/slower.