Inspired by this recent question on SO and the answers given, which made me feel very ignorant, I decided I'd spend some time to learn more about CPU caching and wrote a small program to verify whether I am getting this whole thing right (most likely not, I'm afraid). I'll first write down the assumptions that underlie my expectations, so you could possibly stop me here if those are wrong. Based on what I've read, in general:
- An
n
-way associative cache is divided intos
sets, each containingn
lines, each line having a fixed sizeL
; - Each main memory address
A
can be mapped into any of then
cache lines of one set; - The set into which address
A
is mapped can be found by splitting the address space into slots each of the size of one cache line, then computing the index ofA
's slot (I = A / L
), and finally performing a modulo operation to map the index into the target setT
(T = I % s
); - A cache read miss causes a higher delay than a cache write miss, because the CPU is less likely to stall and stay idle while waiting for the main memory line to be fetched.
My first question is: are these assumptions correct?
Assuming they are, I tried to play a bit with these concepts so I could actually see them having a concrete impact on a program. I wrote a simple test that allocates a memory buffer of B
bytes and repeatedly accesses locations of that buffer with fixed increments of a given step from the beginning of the buffer (meaning that if B
is 14 and the step is 3, I repeatedly visit only locations 0, 3, 6, 9, and 12 - and the same is true if B
is 13, 14, or 15):
int index = 0;
for (int i = 0; i < REPS; i++)
{
index += STEP;
if (index >= B) { index = 0; }
buffer[index] = ...; // Do something here!
}
Due to the above assumptions, my expectations were that:
- When setting
STEP
equal to the critical stride (i.e. the size of a cache line times the number of sets in the cache, orL * s
), performance should be significantly worse than whenSTEP
is set to, for instance, (L * s) + 1
, because we would be accessing only memory locations that get mapped into the same set, forcing a cache line to be evicted more frequently from that set and resulting in a higher rate of cache misses; - When
STEP
is equal to the critical stride, performance should not be affected by the sizeB
of the buffer, as long as this is not too small (otherwise too few locations would be visited and there would be less cache misses); otherwise, the performance should be affected byB
, because with a bigger buffer we are more likely to access locations that get mapped into different sets (especially ifSTEP
is not a multiple of 2); - The performance loss should be worse when reading from and writing to each buffer location than when only writing to those locations: writing to a memory location should not require waiting for the corresponding line to be fetched, so the fact of accessing memory locations that map into the same set (again, by using the critical stride as
STEP
) should have a minor impact.
So I used RightMark Memory Analyzer to find out the parameters of my L1 CPU data cache, tuned up the sizes in my program, and tried it out. This is how I wrote the main cycle (onlyWriteToCache
is a flag that can be set from command line):
...
for (int i = 0; i < REPS; i++)
{
...
if (onlyWriteToCache)
{
buffer[index] = (char)(index % 255);
}
else
{
buffer[index] = (char)(buffer[index] % 255);
}
}
The outcome in short:
- Expectations 1) and 2) were confirmed;
- Expectation 3) was not confirmed.
This fact strikes me, and makes me think there is something I did not get quite right. When B
is 256 MB and STEP
is equal to the critical stride, the test (compiled with -O3 on GCC 4.7.1) shows that:
- The write-only version of the cycle suffers from an average ~6x performance loss (6.234s vs 1.078s);
- The read-write version of the cycle suffers from an average ~1.3x performance loss (6.671s vs 5.25s).
So my second question is: why this difference? I would expect the performance loss to be higher when reading and writing than when only writing.
For the sake of completeness, below is the program I wrote for doing the tests, where the constants reflect the hardware parameters of my machine: the size of the L1 8-way associative data cache is 32 KB and the size L
of each cache line is 64 bytes, which gives a total of 64 sets (the CPU has a separate L1 8-way instruction cache of the same size and with identical line size).
#include <iostream>
#include <ctime>
#include <cstdlib>
#include <iterator>
#include <algorithm>
using namespace std;
// Auxiliary functions
constexpr int pow(int base, int exp)
{
return ((exp == 0) ? 1 : base * pow(base, exp - 1));
}
int main(int argc, char* argv[])
{
//======================================================================
// Define behavior from command-line arguments
//======================================================================
bool useCriticalStep = false;
bool onlyWriteToCache = true;
size_t BUFFER_SIZE = pow(2, 28);
size_t REPS = pow(2, 27);
if (argc > 0)
{
for (int i = 1; i < argc; i++)
{
string option = argv[i];
if (option == "-c")
{
useCriticalStep = true;
}
else if (option == "-r")
{
onlyWriteToCache = false;
}
else if (option[1] == 's')
{
string encodedSizeInMB = option.substr(2);
size_t sizeInMB = atoi(encodedSizeInMB.c_str());
BUFFER_SIZE = sizeInMB * pow(2, 20);
}
else if (option[1] == 'f')
{
string encodedNumOfReps = option.substr(2);
size_t millionsOfReps = atoi(encodedNumOfReps.c_str());
REPS = millionsOfReps * pow(10, 6);
}
}
}
//======================================================================
// Machine parameters
//======================================================================
constexpr int CACHE_SIZE = pow(2, 15);
constexpr int CACHE_LINE_SIZE = 64;
constexpr int CACHE_LINES_PER_SET = 8;
constexpr int SET_SIZE = CACHE_LINE_SIZE * CACHE_LINES_PER_SET;
constexpr int NUM_OF_SETS = CACHE_SIZE / SET_SIZE;
//======================================================================
// Print out the machine parameters
//======================================================================
cout << "CACHE SIZE: " << CACHE_SIZE / 1024 << " KB" << endl;
cout << "CACHE LINE SIZE: " << CACHE_LINE_SIZE << " bytes" << endl;
cout << "CACHE LINES PER SET: " << CACHE_LINES_PER_SET << endl;
cout << "SET SIZE: " << SET_SIZE << " bytes" << endl;
cout << "NUMBER OF SETS: " << NUM_OF_SETS << endl;
fill_n(ostream_iterator<char>(cout), 30, '='); cout << endl;
//======================================================================
// Test parameters
//======================================================================
const int STEP = NUM_OF_SETS * CACHE_LINE_SIZE + (useCriticalStep ? 0 : 1);
//======================================================================
// Print out the machine parameters
//======================================================================
cout << "BUFFER SIZE: " << BUFFER_SIZE / pow(2, 20) << " MB" << endl;
cout << "STEP SIZE: " << STEP << " bytes" << endl;
cout << "NUMBER OF REPS: " << REPS << endl;
fill_n(ostream_iterator<char>(cout), 30, '='); cout << endl;
//======================================================================
// Start the test
//======================================================================
char* buffer = new char[BUFFER_SIZE];
clock_t t1 = clock();
int index = 0;
for (size_t i = 0; i < REPS; i++)
{
index += STEP;
if (index >= BUFFER_SIZE)
{
index = 0;
}
if (onlyWriteToCache)
{
buffer[index] = (char)(index % 255);
}
else
{
buffer[index] = (char)(buffer[index] % 255);
}
}
clock_t t2 = clock();
//======================================================================
// Print the execution time (in clock ticks) and cleanup resources
//======================================================================
float executionTime = (float)(t2 - t1) / CLOCKS_PER_SEC;
cout << "EXECUTION TIME: " << executionTime << "s" << endl;
delete[] buffer;
}
Thank you in advance if you managed to read through this long question.