Iteration over 2d matrix generates more cache references in perf after switching order of indices

Question

i came across this GitHub repo, which links to this blog post about cache aware programming. It compares the number of cache-references and cache-misses in perf output depending on the spread of data inside of an array in relation to the cache line size. I wanted to perform a similar experiment on my own. The code i came up with looks like this:

// depth_first.c
#include <stdint.h>
#include <stdlib.h>
#define ROWS 100000
#define COLS 64

int main() {
    uint8_t (*mat)[COLS] = malloc(sizeof(uint8_t[ROWS][COLS])); // matrix

    uint8_t sum = 0;

    for(int row_idx = 0; row_idx < ROWS; row_idx++) {
        for(int col_idx = 0; col_idx < COLS; col_idx++) {
            sum += mat[row_idx][col_idx];
        }
    }

    free(mat);
}

(I know i am not initializing the array after malloc which is UB, however i am not interested in the actual values and i assume it will not impact the cache performance)

I dynamically allocate a 2D array of uint8_t called mat. Its dimensions are 100000 rows, 64 columns each. As far as i understand, this means that the memory layout of the array is as follows:

[ Row 0, 64 bytes ][ Row 1, 64 bytes ][ Row 2, 64 bytes ]...[ Row 99999, 64 bytes ]

Each row occupies a continuous area of memory. I also specifically chose 64 bytes as it is the size of a cache line on my CPU. This means that a row should perfectly fit inside of a cache line.

Now, my experiment is as follows: i iterate through the array "depth first", by visiting every column inside of the first row, before moving to the second one etc. as seen in the original code above. Then i modify the code to access the array "breadth first", by iterating through first element of every row, then second element of every row, etc:

// breadth_first.c for loop
for(int col_idx = 0; col_idx < COLS; col_idx++) { // col_idx in outer loop
    for(int row_idx = 0; row_idx < ROWS; row_idx++) { // row_dix in inner
        sum += mat[row_idx][col_idx];
    }
}

I compile both versions with no optimizations:

gcc -O0 breadth_first.c -o breadth_first
gcc -O0 depth_first.c -o depth_first

And test them using perf:

perf stat -e cache-references,cache-misses ./breadth_first
perf stat -e cache-references,cache-misses ./depth_first

The output i receive looks as follows (the numbers and percentages vary only a little between iterations):


 Performance counter stats for './breadth_first':

        12 654 452      cache-references:u                                          
           106 456      cache-misses:u            #    0,841 % of all cache refs    

       0,015068004 seconds time elapsed

       0,015102000 seconds user
       0,000000000 seconds sys



 Performance counter stats for './depth_first':

           213 178      cache-references:u                                          
             5 901      cache-misses:u            #    2,768 % of all cache refs    

       0,026617312 seconds time elapsed

       0,026690000 seconds user
       0,000000000 seconds sys

Now what i expected to see was similar number of cache-references in both, with a larger percentage/number of cache-misses in the breadth_first case. I expected similar number of references in both, since both versions of code do the same "thing" and perform same number of accesses into the 2d array.

Instead, the number of cache-references grew immensely. While the cache-misses also grew, the percentage was actually better in the case of breadth_first (probably due to large number of references, so the percentage alone is not indicative of anything).

Now my question is, what causes the increase in cache-references?

Update

I am testing this on an AMD CPU, in case this matters.

I located the mapping in Kernel source from event id to what i assume is a hardware event id. Am now trying to navigate the AMD PPR section 2.1.14 to find the description of the underlying hardware event. I will still appreciate an informed answer to the question.

I remember also having troubles understanding the real meaning of each event, and let's say that it's pretty hard to find a good documentation about it. If i'm not mistaken, `cache-references` is linked to a hardware event (from the doc: `hardware events are CPU specific and documented by the CPU vendor`), which roughly counts both stores and loads to the cache. I just assume that because there was more misses, the cache had to be updated a lot more. Once again, I'm really not sure. [Maybe this can help you](https://perf.wiki.kernel.org/index.php/Tutorial#Events) — Ladislus, Jun 24 '22 at 12:28
Thanks @Ladislus, according to [this](https://stackoverflow.com/a/55051633/5457426) answer on Intel CPUs this event maps to the LLC references event, which i found described in [Intel SDM](https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html) volume 3 section 19.2.1.2. If the "reference" means both a read and an update to the cache then your explanation might make sense, though am still not sure. Also am on AMD CPU and i find [AMD PPR](https://developer.amd.com/resources/developer-guides-manuals/) much harder to navigate. — msaw328, Jun 24 '22 at 12:52
*I compile both versions with no optimizations:* - That's a possible problem; most of the loads are going to be local vars on the stack, instead of keeping them in registers. Also, glibc startup overhead runs many instructions, you might want to put a repeat loop around that. (Maybe put your matrix loop inside an `__attribute__((noinline,noipa))` function so you can enable optimization without defeating the benchmark). Also, if you don't want to init memory yourself, use `calloc`; that's what it's for. — Peter Cordes, Jun 24 '22 at 19:02
Or yeah, if that event maps to LLC references, not L1d loads, that would explain it. `perf list` should show you generic events like `L1-dcache-loads` and `LLC-loads` which perf has to map to something, and also events like `mem_load_retired.l2_miss` (on my Skylake) which can count all loads that had to wait for L3 or DRAM. Or events like `l2_lines_in.all` to count per line, so multiple misses to the same line at the same time are still just one line. — Peter Cordes, Jun 24 '22 at 19:09

score 2 · Accepted Answer · answered Nov 15 '22 at 16:05

Self-answering my question after i had a bit more time to research the topic. Thanks to @Ladislus and @Peter Cordes for commenting, as their hints led me in the right direction. The answer will be a bit lengthy, as the question was for an explanation of a phenomenon, instead of a problem solution and i want to give a good explanation of what was happening.

To begin with, CPU i am currently testing this on is the AMD 4650U mobile APU. An excerpt from /proc/cpuinfo clarifies some important specifics:

vendor_id   : AuthenticAMD
cpu family  : 23
model       : 96
model name  : AMD Ryzen 5 PRO 4650U with Radeon Graphics

For the record, this may be a different CPU from the one i was using when asking the question, but it is not important. CPU family is 23, so 17h in hexadecimal and model is 96 which is 60h. This is important in deciding where to look for information.

Since i am on AMD i begun by looking in AMD PPR. At the time of writing of this answer there are multiple versions of the PPR targeted at different families and model ranges within those families. I obviously consulted the PPR meant for family 17h and model 60h. Most information should be there, mostly in 2.1.14 "MSR registers" and 2.1.15 "Performance Monitor Counters". It feels like the most "official" source for this kind of info, as hinted by man perf-list in the "RAW HARDWARE EVENT DESCRIPTOR" section:

For instance on x86 CPUs, N is a hexadecimal value that represents the
raw register encoding with the layout of IA32_PERFEVTSELx MSRs (see
[Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume
3B: System Programming Guide] Figure 30-1 Layout of IA32_PERFEVTSELx
MSRs) or AMD’s PERF_CTL MSRs (see the Core Complex (CCX) → Processor
x86 Core → MSR Registers section of the [AMD Processor Programming
Reference (PPR)] relevant to the family, model and stepping of the
processor being used).

Sadly, i found PPR to be difficult to navigate and could not really decode any info from it.

In the end i found most use in Linux kernel sources. Particularly, the /arch/x86/events/amd/core.c file (mirror) contained mappings from events to Event ids and UMasks in hex, while files in /tools/perf/pmu-events/arch/x86/amdzen2/ (especially core.json, cache.json and memory.json) listed all the events. The directory amdzen2 was chosen due to the microarch of my CPU being Zen2, but it may also be found by inspecting the mapfile.csv which matches specific directory based on (what looks like) info from /proc/cpuinfo and a regex.

As for the events, they are encoded as 4-digit hexadecimal numbers. From what i have gathered, first 2 digits are a UMask, while the second pair is the Event id. In core.c i have found mappings from Linux event names to hardware event ids (for family 17h):

static const u64 amd_f17h_perfmon_event_map[PERF_COUNT_HW_MAX] =
{
    [PERF_COUNT_HW_CPU_CYCLES]      = 0x0076,
    [PERF_COUNT_HW_INSTRUCTIONS]        = 0x00c0,
    [PERF_COUNT_HW_CACHE_REFERENCES]    = 0xff60,
    [PERF_COUNT_HW_CACHE_MISSES]        = 0x0964,
    [PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = 0x00c2,
    [PERF_COUNT_HW_BRANCH_MISSES]       = 0x00c3,
    [PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = 0x0287,
    [PERF_COUNT_HW_STALLED_CYCLES_BACKEND]  = 0x0187,
};

It seems that cache references maps to the id 0xff60 and cache misses map to 0x0964. This can be verified by running perf stat with both the human-friendly name of the event as well as the hardware id, the counts are identical:

perf stat -e cache-references,rff60,cache-misses,r0964 ./breadth

 Performance counter stats for './breadth':

     1 260 682 257      cache-references:u                                                 
     1 260 682 257      rff60:u                                                            
        10 728 501      cache-misses:u                   #    0,851 % of all cache refs    
        10 728 501      r0964:u

This additionally explains why the count of references increased between depth-first and breadth-first search. As guessed in the comments, cache-references event includes both reads and writes to the cache, as well as some other metrics. To be specific, cache.json begins with a list of events with Event id 0x60. Each has a single-bit UMask:

[
  {
    "EventName": "l2_request_g1.rd_blk_l",
    "EventCode": "0x60",
    "BriefDescription": "All L2 Cache Requests (Breakdown 1 - Common). Data cache reads (including hardware and software prefetch).",
    "UMask": "0x80"
  },
  {
    "EventName": "l2_request_g1.rd_blk_x",
    "EventCode": "0x60",
    "BriefDescription": "All L2 Cache Requests (Breakdown 1 - Common). Data cache stores.",
    "UMask": "0x40"
  }, // Many more events here...
]

There is not a singular event with Event id 0x60 which would have a mask of 0xff, which makes me assume that the mask is an AND combination of multiple single-bit masks. Overall, this means that the cache-references event is a sum of all the events with Event id 0x60. Some of those include stores, shared access and even instruction cache metrics. This answers why the count of references changed between the two programs.

Afterwards i decided to find metrics which may be better for the experiment. Thanks to the comments i decided to use L1-dcache-loads and L1-dcache-load-misses. These produced results which seem closer to what i was expecting:

perf stat -e L1-dcache-loads,L1-dcache-load-misses,r0040,rc860 ./depth

 Performance counter stats for './depth':

     1 280 108 192      L1-dcache-loads:u                                                  
            14 834      L1-dcache-load-misses:u          #    0,00% of all L1-dcache accesses
     1 280 108 192      r0040:u                                                            
            14 834      rc860:u                                                            

       0,340287009 seconds time elapsed

       0,340152000 seconds user
       0,000000000 seconds sys


perf stat -e L1-dcache-loads,L1-dcache-load-misses,r0040,rc860 ./breadth

 Performance counter stats for './breadth':

     1 281 993 990      L1-dcache-loads:u                                                  
         1 635 879      L1-dcache-load-misses:u          #    0,13% of all L1-dcache accesses
     1 281 993 990      r0040:u                                                            
         1 635 879      rc860:u                                                            

       0,464484543 seconds time elapsed

       0,464455000 seconds user
       0,000000000 seconds sys

As to where did i find the hardware ids of said events, ids of 0x0040 and 0xc860 were found in core.c on line 131:

[C(L1D)] = {
    [C(OP_READ)] = {
        [C(RESULT_ACCESS)] = 0x0040, /* Data Cache Accesses */
        [C(RESULT_MISS)]   = 0xc860, /* L2$ access from DC Miss */
    }
// Many more elements
}

0xc860 includes only three events with Event id 0x60 - with UMasks 0x80, 0x40 and 0x08. This is basically a smaller subset of the 0xff60 event. The 0x0040 is also named "ls_dc_accesses" in the kernel and can be found in memory.json on line 87.

While i am happy with the results for now, i am aware that these two events are not exactly most accurate. The 0xc860 still includes some writes and 0x0040 is described as "speculative" (which i am not yet sure what it means in this context). Overall i will probably play around more with various metrics to see which produces most accurate representation of what i want to know about the cache. Despite this, the question was mostly about the spike in cache-references and i decided that i have enough information and sources to answer it.

A "speculative" event might mean that it includes counts for things that happened in mis-speculated paths of execution, not just loads that actually made it to retirement. — Peter Cordes, Nov 15 '22 at 23:24

Iteration over 2d matrix generates more cache references in perf after switching order of indices

Update

1 Answers1