To measure the impact of cache-misses in a program, I want to latency caused by cache-misses to the cycles used for actual computation.
I use perf stat
to measure the cycles, L1-loads, L1-misses, LLC-loads and LLC-misses in my program. Here is a example output:
467 769,70 msec task-clock # 1,000 CPUs utilized
1 234 063 672 432 cycles # 2,638 GHz (62,50%)
572 761 379 098 instructions # 0,46 insn per cycle (75,00%)
129 143 035 219 branches # 276,083 M/sec (75,00%)
6 457 141 079 branch-misses # 5,00% of all branches (75,00%)
195 360 583 052 L1-dcache-loads # 417,643 M/sec (75,00%)
33 224 066 301 L1-dcache-load-misses # 17,01% of all L1-dcache hits (75,00%)
20 620 655 322 LLC-loads # 44,083 M/sec (50,00%)
6 030 530 728 LLC-load-misses # 29,25% of all LL-cache hits (50,00%)
Then my question is: How to convert the number of cache-misses into a number of "lost" clock cycles? Or alternatively, what is the proportion of time spent for fetching data?
I think the factor should be known by the constructor. My processor is Intel Core i7-10810U, and I couldn't find this information in the specifications nor in this list of benchmarked CPUs.
This related problem describes how to measure the number of cycles lost in a cache-miss, but is there a way to obtain this as hardware information? Ideally, the output would be something like:
L1-hit: 3 cycles
L2-hit: 10 cycles
LLC-hit: 30 cycles
RAM: 300 cycles