perf stat
has some named "metrics" that it knows how to calculate from other things. According to perf list
on my system, those include L3_Cache_Access_BW
and L3_Cache_Fill_BW
.
- L3_Cache_Access_BW
[Average per-core data access bandwidth to the L3 cache [GB / sec]]
- L3_Cache_Fill_BW
[Average per-core data fill bandwidth to the L3 cache [GB / sec]]
This is from my system with a Skylake (i7-6700k). Other CPUs (especially from other vendors and architectures) might have different support for it, or IDK might not support these metrics at all.
I tried it out for a simplistic sieve of Eratosthenes (using a bool array, not a bitmap), from a recent codereview question since I had a benchmarkable version of that (with a repeat loop) lying around. It measured 52 GB/s total bandwidth (read+write I think).
The n=4000000 problem-size I used thus consumes 4 MB total, which is larger than the 256K L2 size but smaller than the 8MiB L3 size.
$ echo 4000000 |
taskset -c 3 perf stat --all-user -M L3_Cache_Access_BW -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions ./sieve
Performance counter stats for './sieve-buggy':
7,711,201,973 offcore_requests.all_requests # 816.916 M/sec
# 52.27 L3_Cache_Access_BW
9,441,504,472 ns duration_time # 1.000 G/sec
9,439.41 msec task-clock # 1.000 CPUs utilized
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
1,020 page-faults # 108.058 /sec
38,736,147,765 cycles # 4.104 GHz
53,699,139,784 instructions # 1.39 insn per cycle
9.441504472 seconds time elapsed
9.432262000 seconds user
0.000000000 seconds sys
Or with just -M L3_Cache_Access_BW
and no -e
events, it just shows offcore_requests.all_requests # 54.52 L3_Cache_Access_BW
and duration_time
. So it overrides the default and doesn't count cycles,instructions
and so on.
I think it's just counting all off-core requests by this core, assuming (correctly) that each one involves a 64-byte transfer. It's counted whether it hits or misses in L3 cache. Getting mostly L3 hits will obviously enable a higher bandwidth than if the uncore bottlenecks on the DRAM controllers instead.