Unintuitive performance results for cache thrashing related to cache associativity

Question

I am trying to demonstrate cache associativity on Intel CPUs by accessing memory with a stride equal to a power of two. I access a K x N matrix of doubles in column-wise fashion (in C++), and expect that for K > 8 the performance will be degraded if N is a power of two (because I'm working on CPUs with 8-way set associative L1 cache). I therefore plot the performance of memory access for K=1..40, for N=2^20, N=2^20+8 and N=2^20+64, thus making a 0, 1 and 8 cache lines padding, respectively.

#include <cstdio>
#include <cstdlib>
#include <chrono>

double buffer[((1 << 20) + 64) * 40];

void measure_flops(int N, int K) {
    // Warm-up.
    for (int j = 0; j < 10; ++j)
        for (int i = 0; i < N * K; ++i)
            buffer[i] += 1e-10 * i + 0.123 * j;

    // Iterate.
    int repeat = 500 / K;
    auto start = std::chrono::steady_clock::now();
    for (int r = 0; r < repeat; ++r)
        for (int i = 0; i < N; ++i)
            for (int j = 0; j < K; ++j)
                buffer[j * N + i] += 1.0123;
    auto end = std::chrono::steady_clock::now();
    std::chrono::duration<double> diff = end - start;

    volatile double tmp = buffer[rand() % (N * K)]; (void)tmp;

    // Report.
    double flops = (double)repeat * N * K / diff.count();
    printf("%d  %2d  %.4lf\n", N, K, flops * 1e-9);
    fflush(stdout);
}

void run(int N) {
    printf("      N   K  GFLOPS\n");
    for (int K = 1; K <= 40; ++K)
        measure_flops(N, K);
    printf("\n\n");
}

int main() {
    const int N = 1 << 20;

    run(N);
    run(N + 64 / sizeof(double));
    run(N + 512 / sizeof(double));

    return 0;
}

I run the benchmark on the following CPUs:

Intel Xeon E5-2680 v3 (Haswell, associativity is 8, 8, 20 for L1, L2 and L3, respectively)
Intel Xeon E5-2670 v3 (Haswell, associativity is 8, 8, 20)
Intel Xeon Gold 6150 (Skylake, associativity is 8, 16, 11)

...and I get inconsistent result and certain shapes I don't understand.

My questions are:

1) Why did the performance not recover for padding=64 B for E5-2680, if it did for E5-2670? Performance recovers for padding=512 B, but why? (Is it possible that prefetcher thrashes the cache?)

EDIT: Tested on E5-2690 and E5-2650 too, on another cluster. E5-2690 behaves as E5-2680 (bad performance for padding=64 B), while E5-2650 behaves as E5-2670 (good performance). Thus, it seems like really something changed between 2670 and 2680.

2) Why does the performance drop so early, before K reaches 8? (I suspect it is the "bogus store forwarding stall" as explained in a somewhat different setup here, but I'm not sure.)

3) Why does then again the performance drop for K=~32 for E5-2680 and K=~35 for Gold 6150? (perf stat -ddd shows an increased number of L3 cache misses there, but I don't see the exact reason why, as K does not match L3 associativity.)

Setup details:

E5-2680 and Gold 6150 are both running on a cluster, where I take a full node to make sure there are no interference with other users. A node contains 2 CPUs in both cases, but I anyway use a single core only. E5-2670 is a on a single-CPU shared machine (cannot get a full node), but ran the benchmark when it was inactive.

I compiled the code with g++ -O1 -std=c++11 (version 6.3.0 for E5-2680 and Gold, version 7.3.0 for E5-2670). All results are reproducible. Changing the compiler (or for example adding -O3, or adding volatile to buffer) changes more or less only results for low K, e.g. K=2 gets faster/slower.

Wouldn't be easier to work with an unidimensional buffer? I haven't read the code carefully but isn't *N* more important than *K*? Also note that the HW prefetchers may obfuscate the results. — Margaret Bloom, Oct 15 '18 at 12:06
In order to achieve cache thrashing, I have to access multiple (K) memory locations of the memory with a large power-of-two stride (N) in between them, so logically it's a matrix. Tested now with N=2^16 and 2^18, plots didn't change qualitatively, so I don't think N matters. K on the other hand does, as it is related to cache associativity. I agree that prefetcher may obfuscate results, but I would like to understand when and where exactly. — ikicic, Oct 15 '18 at 13:20
You don't need a matrix and in fact your a not using one. I also don't see why you need to access *K* consecutive 8B memory locations since to test associativity you only have to access all the memory locations that alias to the same set. — Margaret Bloom, Oct 15 '18 at 13:54
True, matrix is not important here (just that it's a buffer of size K * N). They are not consecutive, they differ by 8*N bytes. The innermost for-loop `j` iterates over these K locations that map to the same cache set (the index is `j * N + i` and not `i * N + j`). — ikicic, Oct 15 '18 at 14:29
I haven't looked through your code, but are you aware that Intel's L3 cache index function isn't just a simple range of address bits like L1 and L2 use? It's a hash of more bits, to prevent aliasing with strided access patterns. See links in comments here: [According to Intel my cache should be 24-way associative though its 12-way, how is that?](https://stackoverflow.com/posts/comments/82982252). And also possibly relevant: Adaptive L3 replacement policy [Which cache mapping technique is used in intel core i7 processor?](https://stackoverflow.com/q/49092541)). — Peter Cordes, Oct 15 '18 at 14:43
Have you checked your transparent hugepage settings to make sure you're getting consistent results without a lot of TLB misses? Linux will use hugepages for the BSS, but unless you have `/sys/kernel/mm/transparent_hugepage/defrag` set to `always`, you might want to `madvise(MADV_HUGEPAGE)` on that array. You can check /proc/PID/smaps while your process is running (or started and suspended) for AnonHugePages in the relevant mapping. — Peter Cordes, Oct 15 '18 at 14:49
Oh, so L3 does behave differently, makes sense. For large K (question 3), I do get a lot of TLB misses together with L3 misses. `/sys/...` says `[always] madvise never`. AnonHugePages is always 0kb, even with `madvise`. With `hugectl --bss` it changes, but performance is the same. How big impact can hugepages have here? (I'm not familiar with them) But if L3 has a special mapping, then it can explain the drop at K=~35, no? — ikicic, Oct 15 '18 at 16:13
First, I think it's important to use the same core frequency on all of the three processors so we can make a valid comparison. Second, E5-2680 and E5-2670 have the same exact microarchitecture except that 2680 has a slightly higher base and turbo frequency. Therefore, they *should* give the same results. Since this is not the case, I think something is perturbing your measurements. Can you use the same executable binary on both (instead of using binaries produced by different compilers)? Can you also use the same kernel version? You really should get the same results on both. Now regarding... — Hadi Brais, Oct 15 '18 at 19:02
...associativity, the `padding=0B` case maps all accessed lines in an outer iteration to the same set in the L1 and L2 caches, so it has the potential for conflict misses at these levels. The other padding cases map different lines to different sets in these levels. This could explain why they perform much better. Also note that L1 and L2 use pseudo-LRU, not LRU. That said, I'm not sure if throughput would clearly demonstrate associativity. Typically latency is used instead. — Hadi Brais, Oct 15 '18 at 19:02
That's the weird part, why do these CPUs behave differently? :/ I've run now the benchmark on a different cluster on E5-2650 and E5-2690 too. Former has no thrashing for `padding=64B`, latter does. So it really looks like something changed between 2670 and 2680. Regarding the kernel, I don't have sudo. Tested with the same executable, no help. Regarding associativity, yes I expected bad performance for `padding=0B`, but I also expect good performance for `padding=64B`. However, E5-2680/90 are still bad for `padding=64B` -.- — ikicic, Oct 15 '18 at 21:39
It's extremely unlikely that E5-2680 and E5-2670 would behave differently. They have been both released at the same time. These processors are pretty much identical. It's much more likely that something is perturbing your measurements on one or both systems. One thing you can do is try to perform the same experiments on other systems but with the same processors. — Hadi Brais, Oct 15 '18 at 22:45
I don't see what would perturb it, 2680/90 are both compute nodes on supercomputing clusters, nothing is running there basically. 2650 is a login node, 2670 a standalone server. Only if other HW affects it, if that makes sense. Any ideas on Q2, why does performance drop already for K < 8 (for Xeon Gold)? — ikicic, Oct 16 '18 at 06:45
Did you fix the frequency to the same value on all systems before you run the experiments? This may have an impact. — Hadi Brais, Oct 16 '18 at 14:06
Sorry for the delay. Having no sudo, I tried affecting the CPU frequency by running dummy work on extra threads (only registers, no memory). Not sure if enough, but it didn't change the benchmark. What did help, however, is to decrease N to 2^15. Thus, it looks like an effect of L3. But, N=2^16 is again slow, even though the effective array size is 5-10MB, much less than 30MB L3 capacity. Still, assuming L3 index function is non-trivial, at this moment I'd guess it's L3 causing the drop at K=9-10 for E5-2680. — ikicic, Oct 26 '18 at 08:49

Unintuitive performance results for cache thrashing related to cache associativity

My questions are:

Setup details:

0 Answers0