Cache miss latency in clock cycles

Question

To measure the impact of cache-misses in a program, I want to latency caused by cache-misses to the cycles used for actual computation. I use perf stat to measure the cycles, L1-loads, L1-misses, LLC-loads and LLC-misses in my program. Here is a example output:

               467 769,70 msec task-clock                #    1,000 CPUs utilized          
        1 234 063 672 432      cycles                    #    2,638 GHz                      (62,50%)
          572 761 379 098      instructions              #    0,46  insn per cycle           (75,00%)
          129 143 035 219      branches                  #  276,083 M/sec                    (75,00%)
            6 457 141 079      branch-misses             #    5,00% of all branches          (75,00%)
          195 360 583 052      L1-dcache-loads           #  417,643 M/sec                    (75,00%)
           33 224 066 301      L1-dcache-load-misses     #   17,01% of all L1-dcache hits    (75,00%)
           20 620 655 322      LLC-loads                 #   44,083 M/sec                    (50,00%)
            6 030 530 728      LLC-load-misses           #   29,25% of all LL-cache hits     (50,00%)

Then my question is: How to convert the number of cache-misses into a number of "lost" clock cycles? Or alternatively, what is the proportion of time spent for fetching data?

I think the factor should be known by the constructor. My processor is Intel Core i7-10810U, and I couldn't find this information in the specifications nor in this list of benchmarked CPUs.

This related problem describes how to measure the number of cycles lost in a cache-miss, but is there a way to obtain this as hardware information? Ideally, the output would be something like:

L1-hit: 3 cycles
L2-hit: 10 cycles
LLC-hit: 30 cycles
RAM: 300 cycles

score 5 · Accepted Answer · answered Jan 13 '21 at 07:46

Out-of-order exec and memory-level parallelism exist to hide some of that latency by overlapping useful work with time data is in flight. If you simply multiplied L3 miss count by say 300 cycles each, that could exceed the total number of cycles your whole program took. The perf event cycle_activity.stalls_l3_miss (which exists on my Skylake CPU) should count cycles when no uops execute and there's an outstanding L3 cache miss. i.e. cycles when execution is fully stalled. But there will also be cycles with some work, but less than without a cache miss, and that's harder to evaluate.

TL:DR: memory access is heavily pipelined; the whole core doesn't stop on one cache miss, that's the whole point. A pointer-chasing benchmark (to measure latency) is merely a worst case, where the only work is dereferencing a load result. See Modern Microprocessors A 90-Minute Guide! which has a section about memory and the "memory wall". See also https://agner.org/optimize/ and https://www.realworldtech.com/haswell-cpu/ to learn more about the details of out-of-order exec CPUs and how they can continue doing independent work while one instruction is waiting for data from a cache miss, up to the limit of their out-of-order window size. (https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/)

Re: numbers from vendors:

L3 and RAM latencies aren't a fixed number of core clock cycles: first, core frequency is variable (and independent of uncore and memory clocks), and second because of contention from other cores, and number of hops over the interconnect. (Related: Is cycle count itself reliable on program timing? discusses some effects of core frequency independent of L3 and memory)

That said, Intel's optimization manual does include a table with exact latencies for L1 and L2, and typical for L3, and DRAM on Skylake-server. (2.2.1.3 Skylake Server Microarchitecture Cache Recommendations) https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html#optimization - they say SKX L3 latency is typically 50-70 cycles. DRAM speed depends some on the timing of your DIMMs.

Other people have tested specific CPUs, like https://www.7-cpu.com/cpu/Skylake.html.

Thanks a lot, the Intel hardware perf events are useful (though quite hard to use). Is it fair to assume that all the Skyline processors have the same pipeline system? — Vendec, Jan 14 '21 at 17:11
@Vendec: yes, the basic architecture of the pipeline is identical even between Skylake-server and -client. L2 cache is a different size, associativity, and latency, and SKX has AVX512 enabled, but everything else is the same. — Peter Cordes, Jan 14 '21 at 17:26

Cache miss latency in clock cycles

1 Answers1

Linked