You created a main
instead of _start
, and probably built it into a dynamically-linked executable!! So there's all the CRT startup code, initializing libc, and several system calls. Run strace ./test
and see how many systems calls it's making. (And of course there's lots of work in user-space that doesn't involve system calls).
What would be more interesting is a statically linked executable that just makes an _exit(0)
or exit_group(0)
system call with the syscall
instruction, from the _start
entry point.
Given an exit.s
with these contents:
mov $231, %eax
syscall
build it into a static executable so these two instructions are the only ones executed in user-space:
$ gcc -static -nostdlib exit.s
/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000401000
# the default is fine, our instructions are at the start of the .text section
$ perf stat -e cache-misses:u ./a.out
Performance counter stats for './a.out':
6 cache-misses:u
0.000345362 seconds time elapsed
0.000382000 seconds user
0.000000000 seconds sys
I told it to count cache-misses:u
to only measure user-space cache misses, instead of everything on the core the process was running on. (That would include kernel cache misses before entering user-space and while handling the exit_group()
system call. And potentially interrupt handlers).
(There is hardware support in the PMU for events to count when the privilege level is user, kernel, or both. So we should expect counts to be off by at most 1 or 2 from counting stuff done during the transition from kernel->user or user->kernel. (Changing CS, potentially resulting in a load from the GDT of the segment descriptor indexed by the new CS value).
But what event does cache-misses
actually count?
How does Linux perf calculate the cache-references and cache-misses events explains:
perf
apparently maps cache-misses
to a HW event that counts last-level cache misses. So it's something like the number of DRAM accesses.
Multiple attempts to access the same line in L1d or L1i cache while an L1 miss is already outstanding just adds another thing waiting for the same incoming cache line. So it's not counting loads (or code-fetch) that have to wait for cache.
Multiple loads can coalesce into one access.
But also remember that code-fetch needs to go through the iTLB, triggering a page-walk. Page-walk loads are cached, i.e. they're fetched through the cache hierarchy. So they're counted by the cache-misses
event if they do miss.
Repeated runs of the program can result in 0
cache-miss events. The executable binary is a file, and the file is cached (OS's disk cache) by the pagecache. That physical memory is mapped into the address-space of the process running it. It can certainly stay hot in L3 across process start/stop. More interesting is that apparently the page-table stays hot, too. (Not literally "stays" hot; I assume the kernel has to write a new one every time. But presumably the page-walker is hitting at least in L3 cache.)
Or at least whatever else was causing the "extra" cache-miss
events doesn't have to happen.
I used perf stat -r16
to run it 16 times and show mean +stddev
$ perf stat -e instructions:u,L1-dcache-loads:u,L1-dcache-load-misses:u,cache-misses:u,itlb_misses.walk_completed:u -r 16 ./exit
Performance counter stats for './exit' (16 runs):
3 instructions:u
1 L1-dcache-loads
5 L1-dcache-load-misses # 506.25% of all L1-dcache hits ( +- 6.37% )
1 cache-misses:u ( +-100.00% )
2 itlb_misses.walk_completed:u
0.0001422 +- 0.0000108 seconds time elapsed ( +- 7.57% )
Note the +-100% on cache-misses.
I don't know why we have 2 itlb_misses.walk_completed events, not just 1. Counting itlb_misses.miss_causes_a_walk:u
instead gives us 4
consistently.
Reducing to -r 1
and running repeatedly with manual up-arrow, cache-misses
bounces around between 3 and 13. The system is mostly idle but with a bit of background network traffic.
I also don't know why anything is showing as an L1D load, or how there can be 6 misses from one load. But Hadi's answer says that perf
's L1-dcache-load-misses event actually counts L1D.REPLACEMENT
, so the page-walks could account for that. While L1-dcache-loads
counts MEM_INST_RETIRED.ALL_LOADS
. mov-immediate
isn't a load, and I wouldn't have thought syscall
is either. But maybe it is, otherwise the HW is falsely counting a kernel instruction or there's an off-by-1 somewhere.