I am trying to establish the bottleneck in my code using perf and ocperf. If I do a 'detailed stat' run on my binary, two statistics are reported in red text, which I suppose mean that it is too high.
L1-dcache-load-misses is in red at 28.60%
iTLB-load-misses is in red at 425.89%
# ~bram/src/pmu-tools/ocperf.py stat -d -d -d -d -d ./bench ray
perf stat -d -d -d -d -d ./bench ray
Loaded 455 primitives.
Testing ray against 455 primitives.
Performance counter stats for './bench ray':
9031.444612 task-clock (msec) # 1.000 CPUs utilized
15 context-switches # 0.002 K/sec
0 cpu-migrations # 0.000 K/sec
292 page-faults # 0.032 K/sec
28,786,063,163 cycles # 3.187 GHz (61.47%)
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
55,742,952,563 instructions # 1.94 insns per cycle (69.18%)
3,717,242,560 branches # 411.589 M/sec (69.18%)
18,097,580 branch-misses # 0.49% of all branches (69.18%)
10,230,376,136 L1-dcache-loads # 1132.751 M/sec (69.17%)
2,926,349,754 L1-dcache-load-misses # 28.60% of all L1-dcache hits (69.21%)
145,843,523 LLC-loads # 16.148 M/sec (69.32%)
49,512 LLC-load-misses # 0.07% of all LL-cache hits (69.33%)
<not supported> L1-icache-loads
260,144 L1-icache-load-misses # 0.029 M/sec (69.34%)
10,230,376,830 dTLB-loads # 1132.751 M/sec (69.34%)
1,197 dTLB-load-misses # 0.00% of all dTLB cache hits (61.59%)
2,294 iTLB-loads # 0.254 K/sec (61.55%)
9,770 iTLB-load-misses # 425.89% of all iTLB cache hits (61.51%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
9.032234014 seconds time elapsed
My questions:
- What would be a reasonable figure for L1 data cache misses?
- What would be a reasonable figure for iTLB-load-misses?
- Why can iTLB-load-misses exceed 100%? In other words: why is iTLB-load-misses exceeding iTLB-loads? I've even seen it spike as high as 568%
Also, my machine has a Haswell CPU. I would have expected the stalled-cycles stat to be included?