CPU cache critical stride test giving unexpected results based on access type

Question

Inspired by this recent question on SO and the answers given, which made me feel very ignorant, I decided I'd spend some time to learn more about CPU caching and wrote a small program to verify whether I am getting this whole thing right (most likely not, I'm afraid). I'll first write down the assumptions that underlie my expectations, so you could possibly stop me here if those are wrong. Based on what I've read, in general:

An n-way associative cache is divided into s sets, each containing n lines, each line having a fixed size L;
Each main memory address A can be mapped into any of the n cache lines of one set;
The set into which address A is mapped can be found by splitting the address space into slots each of the size of one cache line, then computing the index of A's slot (I = A / L), and finally performing a modulo operation to map the index into the target set T (T = I % s);
A cache read miss causes a higher delay than a cache write miss, because the CPU is less likely to stall and stay idle while waiting for the main memory line to be fetched.

My first question is: are these assumptions correct?

Assuming they are, I tried to play a bit with these concepts so I could actually see them having a concrete impact on a program. I wrote a simple test that allocates a memory buffer of B bytes and repeatedly accesses locations of that buffer with fixed increments of a given step from the beginning of the buffer (meaning that if B is 14 and the step is 3, I repeatedly visit only locations 0, 3, 6, 9, and 12 - and the same is true if B is 13, 14, or 15):

int index = 0;
for (int i = 0; i < REPS; i++)
{
    index += STEP;
    if (index >= B) { index = 0; }
    buffer[index] = ...; // Do something here!
}

Due to the above assumptions, my expectations were that:

When setting STEP equal to the critical stride (i.e. the size of a cache line times the number of sets in the cache, or L * s), performance should be significantly worse than when STEP is set to, for instance, (L * s) + 1, because we would be accessing only memory locations that get mapped into the same set, forcing a cache line to be evicted more frequently from that set and resulting in a higher rate of cache misses;
When STEP is equal to the critical stride, performance should not be affected by the size B of the buffer, as long as this is not too small (otherwise too few locations would be visited and there would be less cache misses); otherwise, the performance should be affected by B, because with a bigger buffer we are more likely to access locations that get mapped into different sets (especially if STEP is not a multiple of 2);
The performance loss should be worse when reading from and writing to each buffer location than when only writing to those locations: writing to a memory location should not require waiting for the corresponding line to be fetched, so the fact of accessing memory locations that map into the same set (again, by using the critical stride as STEP) should have a minor impact.

So I used RightMark Memory Analyzer to find out the parameters of my L1 CPU data cache, tuned up the sizes in my program, and tried it out. This is how I wrote the main cycle (onlyWriteToCache is a flag that can be set from command line):

    ...
    for (int i = 0; i < REPS; i++)
    {
        ...
        if (onlyWriteToCache)
        {
            buffer[index] = (char)(index % 255);
        }
        else
        {
            buffer[index] = (char)(buffer[index] % 255);
        }
    }

The outcome in short:

Expectations 1) and 2) were confirmed;
Expectation 3) was not confirmed.

This fact strikes me, and makes me think there is something I did not get quite right. When B is 256 MB and STEP is equal to the critical stride, the test (compiled with -O3 on GCC 4.7.1) shows that:

The write-only version of the cycle suffers from an average ~6x performance loss (6.234s vs 1.078s);
The read-write version of the cycle suffers from an average ~1.3x performance loss (6.671s vs 5.25s).

So my second question is: why this difference? I would expect the performance loss to be higher when reading and writing than when only writing.

For the sake of completeness, below is the program I wrote for doing the tests, where the constants reflect the hardware parameters of my machine: the size of the L1 8-way associative data cache is 32 KB and the size L of each cache line is 64 bytes, which gives a total of 64 sets (the CPU has a separate L1 8-way instruction cache of the same size and with identical line size).

#include <iostream>
#include <ctime>
#include <cstdlib>
#include <iterator>
#include <algorithm>

using namespace std;

// Auxiliary functions

constexpr int pow(int base, int exp)
{
    return ((exp == 0) ? 1 : base * pow(base, exp - 1));
}

int main(int argc, char* argv[])
{
    //======================================================================
    // Define behavior from command-line arguments
    //======================================================================

    bool useCriticalStep = false;
    bool onlyWriteToCache = true;
    size_t BUFFER_SIZE = pow(2, 28);
    size_t REPS = pow(2, 27);

    if (argc > 0)
    {
        for (int i = 1; i < argc; i++)
        {
            string option = argv[i];
            if (option == "-c")
            {
                useCriticalStep = true;
            }
            else if (option == "-r")
            {
                onlyWriteToCache = false;
            }
            else if (option[1] == 's')
            {
                string encodedSizeInMB = option.substr(2);
                size_t sizeInMB = atoi(encodedSizeInMB.c_str());
                BUFFER_SIZE = sizeInMB * pow(2, 20);
            }
            else if (option[1] == 'f')
            {
                string encodedNumOfReps = option.substr(2);
                size_t millionsOfReps = atoi(encodedNumOfReps.c_str());
                REPS = millionsOfReps * pow(10, 6);
            }
        }
    }

    //======================================================================
    // Machine parameters
    //======================================================================

    constexpr int CACHE_SIZE = pow(2, 15);
    constexpr int CACHE_LINE_SIZE = 64;
    constexpr int CACHE_LINES_PER_SET = 8;
    constexpr int SET_SIZE = CACHE_LINE_SIZE * CACHE_LINES_PER_SET;
    constexpr int NUM_OF_SETS = CACHE_SIZE / SET_SIZE;

    //======================================================================
    // Print out the machine parameters
    //======================================================================

    cout << "CACHE SIZE: " << CACHE_SIZE / 1024 << " KB" << endl;
    cout << "CACHE LINE SIZE: " << CACHE_LINE_SIZE << " bytes" << endl;
    cout << "CACHE LINES PER SET: " << CACHE_LINES_PER_SET << endl;
    cout << "SET SIZE: " << SET_SIZE << " bytes" << endl;
    cout << "NUMBER OF SETS: " << NUM_OF_SETS << endl;

    fill_n(ostream_iterator<char>(cout), 30, '='); cout << endl;

    //======================================================================
    // Test parameters
    //======================================================================

    const int STEP = NUM_OF_SETS * CACHE_LINE_SIZE + (useCriticalStep ? 0 : 1);

    //======================================================================
    // Print out the machine parameters
    //======================================================================

    cout << "BUFFER SIZE: " << BUFFER_SIZE / pow(2, 20) << " MB" << endl;
    cout << "STEP SIZE: " << STEP << " bytes" << endl;
    cout << "NUMBER OF REPS: " << REPS << endl;

    fill_n(ostream_iterator<char>(cout), 30, '='); cout << endl;

    //======================================================================
    // Start the test
    //======================================================================

    char* buffer = new char[BUFFER_SIZE];

    clock_t t1 = clock();

    int index = 0;
    for (size_t i = 0; i < REPS; i++)
    {
        index += STEP;
        if (index >= BUFFER_SIZE)
        {
            index = 0;
        }

        if (onlyWriteToCache)
        {
            buffer[index] = (char)(index % 255);
        }
        else
        {
            buffer[index] = (char)(buffer[index] % 255);
        }
    }

    clock_t t2 = clock();

    //======================================================================
    // Print the execution time (in clock ticks) and cleanup resources
    //======================================================================

    float executionTime = (float)(t2 - t1) / CLOCKS_PER_SEC;
    cout << "EXECUTION TIME: " << executionTime << "s" << endl;

    delete[] buffer;
}

Thank you in advance if you managed to read through this long question.

There are *two* L1 caches, one for code and one for data. So your data cache is probably 16KB. Your processor probably has multiple read ports and one write port. Do mention what you have. — Hans Passant, Jan 27 '13 at 11:11
@HansPassant: The data I provided are for the L1 data cache. There is a separate L1 8-way instruction cache of the same size (and line size). How do I check how many read ports and write ports does my processor have? I know the question may sound stupid to you, but these things are new to me, so please apologize my ignorance. — Andy Prowl, Jan 27 '13 at 12:23

score 2 · Answer 1 · edited May 23 '17 at 12:02

With regards to your expectation number 3, you are right. It is as you might expect. Please check "What every Programmer should know about memory" for more details. It's an excellent series of articles explaining the memory hierarchy.

So why is it hard to confirm number 3: There are two main reasons. One is memory allocation and the other is virtual-physical address translation.

Memory Allocation

There is no strict guarantee what the actual physical address of an allocated memory region is. When you want to test CPU caches I always recommend using posix_memalign to force the allocation to a specific boundary. Otherwise you probably see some weird behavior.

Address Translation

The way how address translation works is nicely explained in the article I mentioned. And to verify your assumption you have to try to pinpoint the expected behaviour. The easiest way to do this is as follows:

Experiment

Allocate a set of k large memory regions (something like 512MB) in form of int arrays and align them all to the page boundary of 4096b. Now iterate over all elements in the memory region and incrementally add more regions of k to your experiment. Measure the time and normalize by the number of elements read.

The code could look like:

#define N 10000000
for(size_t i=0; i < k; ++i) {

   size_t sum=0;
   clock_t t1= clock();
   for(size_t j=0; j < N; ++j) {
       for(size_t u=0; u<i; ++u) {
           sum += data[u][j];
       }
   }

   clock_t t2= clock();

}

So what will happen. All large memory regions are aligned to 4k and based on the previous assumption all elements of same row will map into the same cache set. When the number of projected memory regions in the loop is larger than the associativity of the cache, all access will incur a cache miss and the average processing time per element will increase.

Update

How writes are handled depends on how the cache line is used and the CPU. Modern CPUs apply the MESI protocol for handling writes to cache lines to make sure that all parties have the same view on the memory (cache coherency). Typically before you can write to a cache line the cache line must be read and then written back. If you recognize the write-back or not depends on how you access the data. If you re-read the cache line again, you will probably not notice a difference.

However, while the programmer has typically no influence on how the data is stored in the CPU caches, with writing there is a slight difference. It is possible to perform so called streaming writes that do not pollute the cache but are rather written directly to memory. These writes are also called non-temporal writes.

Thank you for your answer. No offense, I trust it, but it increases the number of question marks in my head rather than reducing them. First, I am developing on Windows: is `posix_memalign()` portable and part of the C++ Standard Library? Second, if my test is unreliable because of memory allocation, why does the write-only part of the test confirm the expected behavior, and the read-write part does not? — Andy Prowl, Jan 27 '13 at 16:01
Oh, and thank you for the link, I'm definitely going to read through that series. — Andy Prowl, Jan 27 '13 at 16:01
There is nothing standard for over-aligned allocation in C++ (they are working on it for C++1y). On Windows, you can use `_aligned_malloc`. — Marc Glisse, Jan 31 '13 at 22:04
Can you please tell what is actually critical stride. I dont find much explanation about it online. — Tarun, Aug 22 '23 at 09:12

Leeor · Answer 2 · 2013-11-02T17:39:11.347

First of all, there's a small clarification that needs to be made - in most cases, a write would still require you to fetch the line into the local cache, since the lines are usually 64Byte and your write might only modify a partial chunk of that - the merge will be made in the cache. Even if you were to write the whole line in one go (which could in theory be possible in some cases), you would still need to wait for the access in order to receive ownership of the line before writing to it - this protocol is called RFO (read for ownership), and it could be quite long, especially if you have a multi-socket system or anything with complicated memory hierarchy.

Having that said, your 4th assumption may still be correct in some cases, since a load operation will indeed require the data to be fetched before the program advances, while a store can be buffered to write later on when possible. However, the load will only stall the program if it's in some critical path (meaning that some other operation waits for its result), a behavior which your test program doesn't exercise. Since most modern CPUs offer out-of-order execution, the following independent instructions are free to go without waiting for the load to complete. In your program, there's no inter-loop dependency except for the simple index advance (which can run ahead easily), so you're basically not bottlenecked on memory latency but rather on memory throughput, which is a totally different thing. By the way, to add such dependency, you could emulate linked-list traversal, or even simpler - make sure that the array is initialized to zero (and switch the writes to zeros only), and add the content of each read value to the index on each iteration (in addition to the increment) - this would create a dependency without changing the addresses themselves. Alternatively, do something nasty like this (assuming that the compiler isn't smart enough to drop this...):

    if (onlyWriteToCache)
    {
        buffer[index] = (char)(index % 255);
    }
    else
    {
        buffer[index] = (char)(buffer[index] % 255);
        index += buffer[index];
        index -= buffer[index];
    }

Now, about the results, it seems that the write vs read+write behave the same when you're jumping by the critical step, as expected (since the read doesn't differ much from the RFO that would be issued by the write anyway). However, for the non-critical step the read+write operation is much slower. Now it's hard to tell without knowing the exact system, but this could happen due to the fact that loads (reads) and stores (writes) are not performed at the same stage in the lifetime of an instructions - this means that between the load and the store that follows, you may already have evicted the line and need to fetch it again a second time. I'm not too sure about that, but if you want to check, maybe you could add an sfence assembly instruction between the iterations (although that would slow you down significantly).

One last note - when you are bandwidth limited, writing can slow you down quite a bit due to another requirement - when you write to memory you fetch a line to the cache and modify it. Modified lines need to be written back to memory (although in reality there's a whole set of lower level caches on the way), which requires resources and can clog down your machine. Try a read-only loop and see how it goes.

Daniil Iaitskov · Answer 3 · 2017-04-26T09:26:26.773

I also tried to step on stride rake once I read about cache mechanics in Optimization C++ by Agner Frog.

According to this books your second assumption is wrong, because memory address always belong to a specific cache line in a set. So every byte could be cached by the same cache lines in different "ways".

My first attempt to do this in user space failed. (I have CPU i5-4200).

Total size 128kb cache set size 8kb => time 18ms; 568000000
Total size 256kb cache set size 16kb => time 13ms; 120000000
Total size 384kb cache set size 24kb => time 12ms; 688000000
Total size 512kb cache set size 32kb => time 14ms; 240000000

$ g++ -std=c++11 -march=native -O3 hit-stride.cpp -o hit-stride

#include<iostream>
#include<chrono>

using namespace std::chrono;
using namespace std;

int main(int argc, char** argv) {
  unsigned int cacheSetSizes[] = { 8, 16, 24, 32 };
  const int ways = 8;

  for (unsigned int i = 0; i < sizeof(cacheSetSizes) / sizeof(int); ++i) {
    const unsigned int setSize = cacheSetSizes[i] * 1024;
    const unsigned int size = setSize * ways * 2;
    char* buffer = new char[size];
    for (int k = 0; k < size; ++k) {
      buffer[k] = k % 127;
    }
    const auto started = steady_clock::now();
    int sum = 0;
    for (int j = 0; j < 1000000; ++j) {
      for (int k = 0; k < size; k += setSize) {
        sum += buffer[k];
      }
    }
    const auto ended = steady_clock::now();
    cout << "Total size " << (size >> 10) << "kb cache set size " << cacheSetSizes[i]
         << "kb => time " << duration_cast<milliseconds>(ended - started).count()
         << "ms; " << sum << endl;
    delete buffer;
  }
  return 0;
}

The "same" code wrapped into a kernel module looks like hits L2: I realized that I need to make memory physically contiguous. It's only possible to do in the kernel mode. My L1 cache size 32kb. In the test I walk over memory range longer that number of ways (8) with step equal to cache size. So I get noticeable slowdown on 32kb (last line).

Apr 26 11:13:54 diehard kernel: [24992.943076] Memory 512 kb is allocated
Apr 26 11:13:54 diehard kernel: [24992.969814] Duration  23524369 ns for cache set size         8 kb; sum = 568000000
Apr 26 11:13:54 diehard kernel: [24992.990886] Duration  21076036 ns for cache set size        16 kb; sum = 120000000
Apr 26 11:13:54 diehard kernel: [24993.013832] Duration  22950526 ns for cache set size        24 kb; sum = 688000000
Apr 26 11:13:54 diehard kernel: [24993.045584] Duration  31760368 ns for cache set size        32 kb; sum = 240000000

$ make && sudo insmod hello.ko && sleep 1 && tail -n 100 /var/log/syslog

#include <linux/module.h>   /* Needed by all modules */
#include <linux/kernel.h>   /* Needed for KERN_INFO */
#include <linux/time.h>    

static unsigned long p = 0;
static struct timespec started, ended;
static unsigned int cacheSetSizes[] = { 8, 16, 24, 32 };
static const u32 ways = 8;
static const u32 m = 2;
static char* buffer;
static unsigned int setSize;
static unsigned int size;
static unsigned int i, j, k;
static int sum;

int init_module(void) {
  s64 st, en, duration;
  u32 max = 1*1024*1024;
  printk(KERN_INFO "Hello world 1.\n");
  p = __get_free_pages(GFP_DMA, get_order(max));
  printk(KERN_INFO "Memory %u kb is allocated\n", ways * m * 32);
  buffer = (char*) p;

  for (k = 0; k < max; ++k) {
    buffer[k] = k % 127;
  }

  for (i = 0; i < sizeof(cacheSetSizes) / sizeof(int); ++i) {
    setSize = cacheSetSizes[i] * 1024;
    size = setSize * ways * m;
    if (size > max) {
      printk(KERN_INFO "size %u is more that %u", size, max);
      return 0;
    }
    getnstimeofday(&started);
    st = timespec_to_ns(&started);

    sum = 0;
    for (j = 0; j < 1000000; ++j) {
      for (k = 0; k < size; k += setSize) {
        sum += buffer[k];
      }
    }

    getnstimeofday(&ended);
    en = timespec_to_ns(&ended);
    duration = en - st;
    printk(KERN_INFO "Duration %9lld ns for cache set size %9u kb; sum = %9d\n",
           duration, cacheSetSizes[i], sum);
  }
  return 0;
}

void cleanup_module(void) {
  printk(KERN_INFO "Goodbye world 1.\n");
  free_pages(p, get_order(1*1024*1024));
  printk(KERN_INFO "Memory is free\n");
}

CPU cache critical stride test giving unexpected results based on access type

3 Answers3

Linked