14

I am trying to understand more about how CPU cache affects performance. As a simple test I am summing the values of the first column of a matrix with varying numbers of total columns.

// compiled with: gcc -Wall -Wextra -Ofast -march=native cache.c
// tested with: for n in {1..100}; do ./a.out $n; done | tee out.csv
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

double sum_column(uint64_t ni, uint64_t nj, double const data[ni][nj])
{
    double sum = 0.0;
    for (uint64_t i = 0; i < ni; ++i) {
        sum += data[i][0];
    }
    return sum;
}

int compare(void const* _a, void const* _b)
{
    double const a = *((double*)_a);
    double const b = *((double*)_b);
    return (a > b) - (a < b);
}

int main(int argc, char** argv)
{
    // set sizes
    assert(argc == 2);
    uint64_t const iter_max = 101;
    uint64_t const ni       = 1000000;
    uint64_t const nj       = strtol(argv[1], 0, 10);

    // initialize data
    double(*data)[nj] = calloc(ni, sizeof(*data));
    for (uint64_t i = 0; i < ni; ++i) {
        for (uint64_t j = 0; j < nj; ++j) {
            data[i][j] = rand() / (double)RAND_MAX;
        }
    }

    // test performance
    double* dt        = calloc(iter_max, sizeof(*dt));
    double const sum0 = sum_column(ni, nj, data);
    for (uint64_t iter = 0; iter < iter_max; ++iter) {
        clock_t const t_start = clock();
        double const sum      = sum_column(ni, nj, data);
        clock_t const t_stop  = clock();
        assert(sum == sum0);
        dt[iter] = (t_stop - t_start) / (double)CLOCKS_PER_SEC;
    }

    // sort dt
    qsort(dt, iter_max, sizeof(*dt), compare);

    // compute mean dt
    double dt_mean = 0.0;
    for (uint64_t iter = 0; iter < iter_max; ++iter) {
        dt_mean += dt[iter];
    }
    dt_mean /= iter_max;

    // print results
    printf("%2lu %.8e %.8e %.8e %.8e\n", nj, dt[iter_max / 2], dt_mean, dt[0],
        dt[iter_max - 1]);

    // free memory
    free(data);
}

However, the results are not quite how I would expect them to be: nj = 1..100

As far as I understand, when the CPU loads a value from data, it also places some of the following values of data in the cache. The exact number depends on the cache line size (64 byte on my machine). This would explain, why with growing nj the time to solution first increases linearly and levels out at some value. If nj == 1, one load places the next 7 values in the cache and thus we only need to load from RAM every 8th value. If nj == 2, following the same logic, we need to access the RAM every 4th value. After some size, we will have to access the RAM for every value, which should result in the performance leveling out. My guess, for why the linear section of the graph goes further than 4 is that in reality there are multiple levels of cache at work here and the way that values end up in these caches is a little more complex than what I explained here.

What I cannot explain is why there are these performance peaks at multiples of 16.

After thinking about this question for a bit, I decided to check if this also occurs for higher values of nj: nj = 1..500

In fact, it does. And, there is more: Why does the performance increase again after ~250?

Could someone explain to me, or point me to some appropriate reference, why there are these peaks and why the performance increases for higher values of nj.

If you would like to try the code for yourself, I will also attach my plotting script, for your convenience:

import numpy as np
import matplotlib.pyplot as plt

data = np.genfromtxt("out.csv")
data[:,1:] /= data[0,1]

dy = np.diff(data[:,2]) / np.diff(data[:,0])
for i in range(len(dy) - 1):
    if dy[i] - dy[i + 1] > (dy.max() - dy.min()) / 2:
        plt.axvline(data[i + 1,0], color='gray', linestyle='--')
        plt.text(data[i + 1,0], 1.5 * data[0,3], f"{int(data[i + 1,0])}",
                 rotation=0, ha="center", va="center",
                 bbox=dict(boxstyle="round", ec='gray', fc='w'))

plt.fill_between(data[:,0], data[:,3], data[:,4], color='gray')
plt.plot(data[:,0], data[:,1], label="median")
plt.plot(data[:,0], data[:,2], label="mean")
plt.legend(loc="upper left")
plt.xlabel("nj")
plt.ylabel("dt / dt$_0$")
plt.savefig("out.pdf")
koipond
  • 306
  • 2
  • 8
  • 1
    "What I cannot explain is why there are these performance peaks at multiples of 16." -- My guess is that you are encountering the same issue as in this question: [Why is my program slow when looping over exactly 8192 elements?](https://stackoverflow.com/q/12264970/12149471) – Andreas Wenzel Apr 10 '22 at 16:13
  • 2
    Re “As far as I understand, when the CPU loads a value from `data`, it also places some of the following values of `data` in the cache.” That is not correct. When there is a load with caching enabled, the CPU loads the cache block the data is in (or the two cache blocks if it spans a boundary). If the data is at the start of the block, then, yes, the other data loaded is the following data. If the data is in the middle or end of the block, then data before it will be loaded too or instead. – Eric Postpischil Apr 10 '22 at 16:43
  • 1
    Re “What I cannot explain is why there are these performance peaks at multiples of 16.”: Cache features have a “geometry”. Among this is that cache lines can only be placed in certain cache sets. Your processor may have, for example, 64 sets, each of which can hold 4 cache lines of 64 bytes each. But for a given cache line, certain bits of its memory address determine which set it must go into. When `nj` is a multiple of 16, your loads are skipping through memory in a way that skips some of those cache sets. In consequence, you only get to use maybe half of cache. – Eric Postpischil Apr 10 '22 at 16:49
  • 1
    With values of `nj` that are not multiples of 16, the loads progress through different cache sets. I have not analyzed your code to see is this particular effect is the cause. – Eric Postpischil Apr 10 '22 at 16:50
  • 2
    "As far as I understand, when the CPU loads a value from data, it also places some of the following values of data in the cache." -- You are probably referring to two separate things: 1. When accessing a `double`, instead of only loading 8 bytes into the cache, it will load an entire [cache line](https://en.wikipedia.org/wiki/CPU_cache#CACHE-LINES) of typically 64 bytes into the cache, which may include data before and after the `double`. 2. If the CPU detects a sequential access pattern, it will [prefetch](https://en.wikipedia.org/wiki/Cache_prefetching) data according to the prediction. – Andreas Wenzel Apr 10 '22 at 17:52

1 Answers1

9

The plots show the combination of several complex low-level effects (mainly cache trashing & prefetching issues). I assume the target platform is a mainstream modern processor with cache lines of 64 bytes (typically a x86 one).

I can reproduce the problem on my i5-9600KF processor. Here is the resulting plot:

performance plot


First of all, when nj is small, the gap between fetched address (ie. strides) is small and cache lines are relatively efficiently used. For example, when nj = 1, the access is contiguous. In this case, the processor can efficiently prefetch the cache lines from the DRAM so to hide its high latency. There is also a good spatial cache locality since many contiguous items share the same cache line. When nj=2, only half the value of a cache line is used. This means the number of requested cache line is twice bigger for the same number of operations. That being said the time is not much bigger due to the relatively high latency of adding two floating-point numbers resulting in a compute-bound code. You can unroll the loop 4 times and use 4 different sum variables so that (mainstream modern) processors can add multiple values in parallel. Note that most processors can also load multiple values from the cache per cycle. When nj = 4 a new cache line is requested every 2 cycles (since a double takes 8 bytes). As a result, the memory throughput can become so big that the computation becomes memory-bound. One may expect the time to be stable for nj >= 8 since the number of requested cache line should be the same, but in practice processors prefetch multiple contiguous cache lines so not to pay the overhead of the DRAM latency which is huge in this case. The number of prefetched cache lines is generally between 2 to 4 (AFAIK such prefetching strategy is disabled on Intel processors when the stride is bigger than 512, so when nj >= 64. This explains why the timings are sharply increasing when nj < 32 and they become relatively stable with 32 <= nj <= 256 with exceptions for peaks.

The regular peaks happening when nj is a multiple of 16 are due to a complex cache effect called cache thrashing. Modern cache are N-way associative with N typically between 4 and 16. For example, here are statistics on my i5-9600KF processors:

Cache 0: L1 data cache,        line size 64,  8-ways,    64 sets, size 32k 
Cache 1: L1 instruction cache, line size 64,  8-ways,    64 sets, size 32k 
Cache 2: L2 unified cache,     line size 64,  4-ways,  1024 sets, size 256k 
Cache 3: L3 unified cache,     line size 64, 12-ways, 12288 sets, size 9216k 

This means that two fetched values from the DRAM with the respective address A1 and A2 can results in conflicts in my L1 cache if (A1 % 32768) / 64 == (A2 % 32768) / 64. In this case, the processor needs to choose which cache line to replace from a set of N=8 cache lines. There are many cache replacement policy and none is perfect. Thus, some useful cache line are sometime evicted too early resulting in additional cache misses required later. In pathological cases, many DRAM locations can compete for the same cache lines resulting in excessive cache misses. More information about this can be found also in this post.

Regarding the nj stride, the number of cache lines that can be effectively used in the L1 cache is limited. For example, if all fetched values have the same address modulus the cache size, then only N cache lines (ie. 8 for my processor) can actually be used to store all the values. Having less cache lines available is a big problem since the prefetcher need a pretty large space in the cache so to store the many cache lines needed later. The smaller the number of concurrent fetches, the lower memory throughput. This is especially true here since the latency of fetching 1 cache line from the DRAM is about several dozens of nanoseconds (eg. ~70 ns) while its bandwidth is about dozens of GiB/s (eg. ~40 GiB/s): dozens of cache lines (eg. ~40) should be fetched concurrently so to hide the latency and saturate the DRAM.

Here is the simulation of the number of cache lines that can be actually used in my L1 cache regarding the value of the nj:

 nj  #cache-lines
  1      512
  2      512
  3      512
  4      512
  5      512
  6      512
  7      512
  8      512
  9      512
 10      512
 11      512
 12      512
 13      512
 14      512
 15      512
 16      256    <----
 17      512
 18      512
 19      512
 20      512
 21      512
 22      512
 23      512
 24      512
 25      512
 26      512
 27      512
 28      512
 29      512
 30      512
 31      512
 32      128    <----
 33      512
 34      512
 35      512
 36      512
 37      512
 38      512
 39      512
 40      512
 41      512
 42      512
 43      512
 44      512
 45      512
 46      512
 47      512
 48      256    <----
 49      512
 50      512
 51      512
 52      512
 53      512
 54      512
 55      512
 56      512
 57      512
 58      512
 59      512
 60      512
 61      512
 62      512
 63      512
 64       64    <----
==============
 80      256
 96      128
112      256
128       32
144      256
160      128
176      256
192       64
208      256
224      128
240      256
256       16
384       32
512        8
1024       4

We can see that the number of available cache lines is smaller when nj is a multiple of 16. In this case, the prefetecher will preload data into cache lines that are likely evicted early by subsequent fetched (done concurrently). Loads instruction performed in the code are more likely to result in cache misses when the number of available cache line is small. When a cache miss happen, the value need then to be fetched again from the L2 or even the L3 resulting in a slower execution. Note that the L2 cache is also subject to the same effect though it is less visible since it is larger. The L3 cache of modern x86 processors makes use of hashing to better distributes things to reduce collisions from fixed strides (at least on Intel processors and certainly on AMD too though AFAIK this is not documented).

Here is the timings on my machine for some peaks:

  32 4.63600000e-03 4.62298020e-03 4.06400000e-03 4.97300000e-03
  48 4.95800000e-03 4.96994059e-03 4.60400000e-03 5.59800000e-03
  64 5.01600000e-03 5.00479208e-03 4.26900000e-03 5.33100000e-03
  96 4.99300000e-03 5.02284158e-03 4.94700000e-03 5.29700000e-03
 128 5.23300000e-03 5.26405941e-03 4.93200000e-03 5.85100000e-03
 192 4.76900000e-03 4.78833663e-03 4.60100000e-03 5.01600000e-03
 256 5.78500000e-03 5.81666337e-03 5.77600000e-03 6.35300000e-03
 384 5.25900000e-03 5.32504950e-03 5.22800000e-03 6.75800000e-03
 512 5.02700000e-03 5.05165347e-03 5.02100000e-03 5.34400000e-03
1024 5.29200000e-03 5.33059406e-03 5.28700000e-03 5.65700000e-03

As expected, the timings are overall bigger in practice for the case where the number of available cache lines is much smaller. However, when nj >= 512, the results are surprising since they are significantly faster than others. This is the case where the number of available cache lines is equal to the number of ways of associativity (N). My guess is that this is because Intel processors certainly detect this pathological case and optimize the prefetching so to reduce the number of cache misses (using line-fill buffers to bypass the L1 cache -- see below).

Finally, for large nj stride, a bigger nj should results in higher overheads mainly due to the translation lookaside buffer (TLB): there are more page addresses to translate with bigger nj and the number of TLB entries is limited. In fact this is what I can observe on my machine: timings tends to slowly increase in a very stable way unlike on your target platform.

I cannot really explain this very strange behavior yet. Here is some wild guesses:

  • The OS could tend to uses more huge pages when nj is large (so to reduce de overhead of the TLB) since wider blocks are allocated. This could result in more concurrency for the prefetcher as AFAIK it cannot cross page boundaries. You can try to check the number of allocated (transparent) huge-pages (by looking AnonHugePages in /proc/meminfo in Linux) or force them to be used in this case (using an explicit memmap), or possibly by disabling them. My system appears to make use of 2 MiB transparent huge-pages independently of the nj value.
  • If the target architecture is a NUMA one (eg. new AMD processors or a server with multiple processors having their own memory), then the OS could allocate pages physically stored on another NUMA node because there is less space available on the current NUMA node. This could result in higher performance due to the bigger throughput (though the latency is higher). You can control this policy with numactl on Linux so to force local allocations.

For more information about this topic, please read the great document What Every Programmer Should Know About Memory. Moreover, a very good post about how x86 cache works in practice is available here.


Removing the peaks

To remove the peaks due to cache trashing on x86 processors, you can use non-temporal software prefetching instructions so cache lines can be fetched in a non-temporal cache structure and into a location close to the processor that should not cause cache trashing in the L1 (if possible). Such cache structure is typically a line-fill buffers (LFB) on Intel processors and the (equivalent) miss address buffers (MAB) on AMD Zen processors. For more information about non-temporal instructions and the LFB, please read this post and this one. Here is the modified code that also include a loop unroling optimization to speed up the code when nj is small:

double sum_column(uint64_t ni, uint64_t nj, double* const data)
{
    double sum0 = 0.0;
    double sum1 = 0.0;
    double sum2 = 0.0;
    double sum3 = 0.0;

    if(nj % 16 == 0)
    {
        // Cache-bypassing prefetch to avoid cache trashing
        const size_t distance = 12;
        for (uint64_t i = 0; i < ni; ++i) {
            _mm_prefetch(&data[(i+distance)*nj+0], _MM_HINT_NTA);
            sum0 += data[i*nj+0];
        }
    }
    else
    {
        // Unrolling is much better for small strides
        for (uint64_t i = 0; i < ni; i+=4) {
            sum0 += data[(i+0)*nj+0];
            sum1 += data[(i+1)*nj+0];
            sum2 += data[(i+2)*nj+0];
            sum3 += data[(i+3)*nj+0];
        }
    }
    
    return sum0 + sum1 + sum2 + sum3;
}

Here is the result of the modified code:

performance plot of the optimized code

We can see that peaks no longer appear in the timings. We can also see that the values are much bigger due to dt0 being about 4 times smaller (due to the loop unrolling).

Note that cache trashing in the L2 cache is not avoided with this method in practice (at least on Intel processors). This means that the effect is still here with huge nj strides multiple of 512 (4 KiB) on my machine (it is actually a slower than before, especially when nj >= 2048). It may be a good idea to stop the prefetching when (nj%512) == 0 && nj >= 512 on x86 processors. The effect AFAIK, there is no way to address this problem. That being said, this is a very bad idea to perform such big strided accesses on very-large data structures.

Note that distance should be carefully chosen since early prefetching can result cache line being evicted before they are actually used (so they need to be fetched again) and late prefetching is not much useful. I think using value close to the number of entries in the LFB/MAB is a good idea (eg. 12 on Skylake/KabyLake/CannonLake, 22 on Zen-2).

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • Wow! Thanks for the detailed answer and linking these resources! You were right about my processor, Intel i7-1160G7 on a ThinkPad X1 Nano running Linux. The loop unrolling is definitely a neat trick. I noticed that clang seems to do this by default and you can instruct gcc to do so with `#pragma GCC unroll n`, then the results are the same as yours, except for the performance increase around `nj == 250`. Maybe this is linked to the newer hardware? – koipond Apr 19 '22 at 16:25
  • 2
    For the loop unrolling this is more complicated: compiler can unroll the loop but not efficiently by default. The reason is that the IEEE-754 norm prevent them to reorder floating point additions. As a result there will still be a long chain of dependent instruction the processor cannot execute efficiently. To fix that, you need to at least enable `-ffast-math` (in fact `-fassociative-math` should be enough in theory but not in practice in many cases). Moreover, the unrolling of GCC tends not to be very clever (it often keep the conditional branches of the loop). – Jérôme Richard Apr 19 '22 at 21:36
  • 2
    For the performance increase, your architecture is quite different: the processor (tiger lake which is quite new) have much larger caches than usual ones and also does not have power of 2 sizes nor a power of two ways, AFAIK tiger lake cores do more advanced optimisations on memory accesses, and it seems this processor use a LPDDR memory which is a bit different from usual DDR RAM. I am not sure this can be the cause of such strange effect. The proposed NUMA hypothesis does not apply in your case. The one about huge pages still applies. – Jérôme Richard Apr 19 '22 at 21:59
  • 1
    Note that regarding the way your DRAM is configured on your motherboard and regarding the kind of RAM, the integrated memory controller of your processor may or may not be able to perform concurrent accesses on the DRAM regarding the distribution of the allocated pages on the DRAM (if they are not interleaved). The result is a very variable throughput regarding the amount of allocated memory. I saw such weird effect on my older PC with 2 DRAM modules of 2+4 GiB. – Jérôme Richard Apr 19 '22 at 22:08
  • 1
    Anyway, the effect certainly comes from one of the following: hardware prefetching, OS paging, DRAM. You can check for the OS paging though it is not easy. If this is not that, then the most probable reason is the DRAM hardware. – Jérôme Richard Apr 19 '22 at 22:13
  • 1
    How does that scale with multi-core? I think each core has its own L1/L2 cache but they share L3? – Matthieu Apr 21 '22 at 14:25
  • 2
    @Matthieu This is dependent of the processor architecture. Most (recent) Intel processors have a shared L3. Some have a reduced L3 only used for atomic operation and shared cached data. On AMD Zen processors, AFAIK cores can access to a limited space of the L3: the L3 slice of their CCX/CCD (a group of core on the same processor belonging to the same NUMA node). Fetching cache lines of 64 bytes while using only 8 bytes tends to saturate the DRAM (or the L3 when the working array is small enough to fit in the cache). 4 cores fetching 64 bytes at 200 MHz is enough to saturate a PC DRAM bandwidth – Jérôme Richard Apr 22 '22 at 11:31
  • 2
    @Matthieu Having a cache trashing issue in the L1 should not a much problem when multiple cores are used (and so likely bound by the DRAM throughput). In the L2 this is more problematic since it will put more pressure on the L3 but the L3 is designed to (mostly) scale with the number of core so the effect should be limited (especially on Zen processors). This is certainly less true on the Intel processors with a reduced L3. Indeed, useful cache lines could be evicted by fetches of some cores so others may need to reload them later (regarding the executed code). – Jérôme Richard Apr 22 '22 at 11:40
  • 2
    @Matthieu Put it shortly, I expect the *DRAM saturation* to result in a *poor scalability* in this case (since data do not fit in cache). Indeed, few cores should be enough to saturate it on most platforms (especially with the modified code). In fact, this is often already the case with contiguous access for such a similar code (if vectorized). – Jérôme Richard Apr 22 '22 at 11:47