i came across this GitHub repo, which links to this blog post about cache aware programming. It compares the number of cache-references and cache-misses in perf
output depending on the spread of data inside of an array in relation to the cache line size. I wanted to perform a similar experiment on my own. The code i came up with looks like this:
// depth_first.c
#include <stdint.h>
#include <stdlib.h>
#define ROWS 100000
#define COLS 64
int main() {
uint8_t (*mat)[COLS] = malloc(sizeof(uint8_t[ROWS][COLS])); // matrix
uint8_t sum = 0;
for(int row_idx = 0; row_idx < ROWS; row_idx++) {
for(int col_idx = 0; col_idx < COLS; col_idx++) {
sum += mat[row_idx][col_idx];
}
}
free(mat);
}
(I know i am not initializing the array after malloc
which is UB, however i am not interested in the actual values and i assume it will not impact the cache performance)
I dynamically allocate a 2D array of uint8_t
called mat
. Its dimensions are 100000 rows, 64 columns each. As far as i understand, this means that the memory layout of the array is as follows:
[ Row 0, 64 bytes ][ Row 1, 64 bytes ][ Row 2, 64 bytes ]...[ Row 99999, 64 bytes ]
Each row occupies a continuous area of memory. I also specifically chose 64 bytes as it is the size of a cache line on my CPU. This means that a row should perfectly fit inside of a cache line.
Now, my experiment is as follows: i iterate through the array "depth first", by visiting every column inside of the first row, before moving to the second one etc. as seen in the original code above. Then i modify the code to access the array "breadth first", by iterating through first element of every row, then second element of every row, etc:
// breadth_first.c for loop
for(int col_idx = 0; col_idx < COLS; col_idx++) { // col_idx in outer loop
for(int row_idx = 0; row_idx < ROWS; row_idx++) { // row_dix in inner
sum += mat[row_idx][col_idx];
}
}
I compile both versions with no optimizations:
gcc -O0 breadth_first.c -o breadth_first
gcc -O0 depth_first.c -o depth_first
And test them using perf
:
perf stat -e cache-references,cache-misses ./breadth_first
perf stat -e cache-references,cache-misses ./depth_first
The output i receive looks as follows (the numbers and percentages vary only a little between iterations):
Performance counter stats for './breadth_first':
12 654 452 cache-references:u
106 456 cache-misses:u # 0,841 % of all cache refs
0,015068004 seconds time elapsed
0,015102000 seconds user
0,000000000 seconds sys
Performance counter stats for './depth_first':
213 178 cache-references:u
5 901 cache-misses:u # 2,768 % of all cache refs
0,026617312 seconds time elapsed
0,026690000 seconds user
0,000000000 seconds sys
Now what i expected to see was similar number of cache-references in both, with a larger percentage/number of cache-misses in the breadth_first
case. I expected similar number of references in both, since both versions of code do the same "thing" and perform same number of accesses into the 2d array.
Instead, the number of cache-references grew immensely. While the cache-misses also grew, the percentage was actually better in the case of breadth_first
(probably due to large number of references, so the percentage alone is not indicative of anything).
Now my question is, what causes the increase in cache-references?
Update
I am testing this on an AMD CPU, in case this matters.
I located the mapping in Kernel source from event id to what i assume is a hardware event id. Am now trying to navigate the AMD PPR section 2.1.14 to find the description of the underlying hardware event. I will still appreciate an informed answer to the question.