Can perf account for all cache misses?

Question

I'm trying to understand the cache misses recorded by perf. I have a minimal program:

int main(void)
{
    return 0;
}

If I compile this:

gcc -std=c99 -W -Wall -Werror -O3 -S -o test.S test.c

I get an expectedly small program:

        .file   "test.c"
        .section        .text.startup,"ax",@progbits
        .p2align 4,,15
        .globl  main
        .type   main, @function
main:
.LFB0:
        .cfi_startproc
        xorl    %eax, %eax
        ret
        .cfi_endproc
.LFE0:
        .size   main, .-main
        .ident  "GCC: (Debian 4.7.2-5) 4.7.2"
        .section        .note.GNU-stack,"",@progbits

With only the two instruction, xorl and ret, the program should be less than a cache line in size so I would expect that if I run perf -e "cache-misses:u" ./test I should see only a single cache miss. However, I instead see between 2 and ~400. Similarly, perf -e "cache-misses" ./test results in ~700 to ~2500.

Is this simply a case of perf estimating counts or is there something about the way cache misses occur that makes reasoning about them approximate? For example, if I generate and then read an array of integers in memory, can I reason about the prefetching (sequential access should allow for perfect prefetching) or is there something else at play?

You created a `main` instead of `_start`, and probably built it into a dynamically-linked executable!! So there's all the CRT startup code, initializing libc, and several system calls. Run `strace ./test`. What would be more interesting is a statically linked executable that just makes an `_exit(0)` system call with the `syscall` instruction, from the `_start` entry point. — Peter Cordes, Sep 27 '19 at 09:32

score 5 · Answer 1 · answered Sep 27 '19 at 18:15

You created a main instead of _start, and probably built it into a dynamically-linked executable!! So there's all the CRT startup code, initializing libc, and several system calls. Run strace ./test and see how many systems calls it's making. (And of course there's lots of work in user-space that doesn't involve system calls).

What would be more interesting is a statically linked executable that just makes an _exit(0) or exit_group(0) system call with the syscall instruction, from the _start entry point.

Given an exit.s with these contents:

mov $231, %eax
syscall

build it into a static executable so these two instructions are the only ones executed in user-space:

$ gcc -static -nostdlib exit.s
/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000401000
  # the default is fine, our instructions are at the start of the .text section

$ perf stat -e cache-misses:u ./a.out 

 Performance counter stats for './a.out':

                 6      cache-misses:u                                              

       0.000345362 seconds time elapsed

       0.000382000 seconds user
       0.000000000 seconds sys

I told it to count cache-misses:u to only measure user-space cache misses, instead of everything on the core the process was running on. (That would include kernel cache misses before entering user-space and while handling the exit_group() system call. And potentially interrupt handlers).

(There is hardware support in the PMU for events to count when the privilege level is user, kernel, or both. So we should expect counts to be off by at most 1 or 2 from counting stuff done during the transition from kernel->user or user->kernel. (Changing CS, potentially resulting in a load from the GDT of the segment descriptor indexed by the new CS value).

But what event does `cache-misses` actually count?

How does Linux perf calculate the cache-references and cache-misses events explains:

perf apparently maps cache-misses to a HW event that counts last-level cache misses. So it's something like the number of DRAM accesses.

Multiple attempts to access the same line in L1d or L1i cache while an L1 miss is already outstanding just adds another thing waiting for the same incoming cache line. So it's not counting loads (or code-fetch) that have to wait for cache. Multiple loads can coalesce into one access.

But also remember that code-fetch needs to go through the iTLB, triggering a page-walk. Page-walk loads are cached, i.e. they're fetched through the cache hierarchy. So they're counted by the cache-misses event if they do miss.

Repeated runs of the program can result in 0 cache-miss events. The executable binary is a file, and the file is cached (OS's disk cache) by the pagecache. That physical memory is mapped into the address-space of the process running it. It can certainly stay hot in L3 across process start/stop. More interesting is that apparently the page-table stays hot, too. (Not literally "stays" hot; I assume the kernel has to write a new one every time. But presumably the page-walker is hitting at least in L3 cache.)

Or at least whatever else was causing the "extra" cache-miss events doesn't have to happen.

I used perf stat -r16 to run it 16 times and show mean +stddev

$ perf stat -e instructions:u,L1-dcache-loads:u,L1-dcache-load-misses:u,cache-misses:u,itlb_misses.walk_completed:u -r 16 ./exit

 Performance counter stats for './exit' (16 runs):

                 3      instructions:u                                              
                 1      L1-dcache-loads                                             
                 5      L1-dcache-load-misses     #  506.25% of all L1-dcache hits    ( +-  6.37% )
                 1      cache-misses:u                                                ( +-100.00% )
                 2      itlb_misses.walk_completed:u                                   

         0.0001422 +- 0.0000108 seconds time elapsed  ( +-  7.57% )

Note the +-100% on cache-misses.

I don't know why we have 2 itlb_misses.walk_completed events, not just 1. Counting itlb_misses.miss_causes_a_walk:u instead gives us 4 consistently.

Reducing to -r 1 and running repeatedly with manual up-arrow, cache-misses bounces around between 3 and 13. The system is mostly idle but with a bit of background network traffic.

I also don't know why anything is showing as an L1D load, or how there can be 6 misses from one load. But Hadi's answer says that perf's L1-dcache-load-misses event actually counts L1D.REPLACEMENT, so the page-walks could account for that. While L1-dcache-loads counts MEM_INST_RETIRED.ALL_LOADS. mov-immediate isn't a load, and I wouldn't have thought syscall is either. But maybe it is, otherwise the HW is falsely counting a kernel instruction or there's an off-by-1 somewhere.

score 4 · Answer 2 · answered Oct 23 '19 at 23:14

This is not an easy topic, but if you are interested in counting cache misses from (for example) accessing an array, then that is what you should start with.

There are numerous pitfalls, but the simplest approach that is likely to lead to insight would start with a program that allocates an array, stores values into the array, and then reads the array a programmable number of times.

Storing values into the array is necessary to create the virtual to physical page mappings. The performance counter results for this section are likely to be incomprehensible because of the tricks that the OS uses in initializing these pages -- e.g., starting with a mapping to a zero-filled page and setting the access to "copy on write".

After the pages are instantiated, the performance counts for the reads are likely to make a lot more sense. I use a programmable number of reads so that I can take the differences between the counter values for 20 reads and 10 reads (for example). The array size should be chosen to be significantly larger than the available cache at the level you want to test.

Unfortunately, "perf" makes it relatively difficult to figure out what is actually being programmed into the performance counters at the hardware level (which is the only level that counts!). The more "generic" the event, the harder it is to guess what is actually being measured.... On my recent Intel-based systems, "perf list" gives a long (>3600 lines) listing of available events. The events starting in the section labelled "cache:" are direct translations of the hardware events that are described in Chapter 19 of Volume 3 of the Intel Architectures Software Developers Manual.

You are correct to be concerned about how hardware prefetches are counted. In recent Intel architectures, events that report cache accesses can typically be configured to count demand accesses, hardware prefetches, or both. Events that report source locations for load instructions won't give any insight into where the HW prefetch found the data -- only how close to the processor it had gotten by the time the load operation executed.

I have found the event "l1d.replacements" to be a reliable L1 Data Cache Miss indicator on recent Intel processors. It simply counts all cache lines moved into the L1 Data Cache (whether due to loads, stores, prefetches, etc). At the other end of the hierarchy, the DRAM counters (e.g., "uncore_imc_0/cas_count_read/") are also reliable, but are subject to contamination due to any other activity in the system. Counters for "two-sided" caches (e.g., L2 & L3) are more likely to be confusing because it is not always clear whether the event is counting cache lines sent in from one side or the other or both (e.g., "l2_lines_in.all"). With some carefully controlled experiments, it is usually possible to find a subset of reliable & understandable events at these intermediate levels. It is not always possible to find enough reliable counters to make a full accounting of all traffic at each level of the memory hierarchy, but that is a longer story....

score -1 · Answer 3 · edited Jun 20 '20 at 09:12

-1

The process memory space is not only about your code, there are difference sources such as heap, stack, data segment will also contribute to the cache misses.

_{(source: tenouk.com)}

I don't think u can estimate cache-misses numbers, just like u cannot predict the running sequence of every thread in a multithreading program.

However, cache misses analysis is useful to find out and target false sharing. Here are some useful links u can refer:

edited Jun 20 '20 at 09:12

Community

1
1

answered Apr 27 '15 at 17:03

qqibrow

2,942
1
24
40

1

This is a single-threaded program; there's no sharing of anything. The OP's guess of about 1 cache miss would have been reasonable if they'd actually built a static executable that just made an exit system call. But instead they still have the CRT code initializing libc and so on. – Peter Cordes Sep 27 '19 at 09:34

Can perf account for all cache misses?

3 Answers3

But what event does `cache-misses` actually count?

Linked

Can perf account for all cache misses?

3 Answers3

But what event does cache-misses actually count?

Linked

But what event does `cache-misses` actually count?