Cache Misses L1 < L2 < L3

Question

I have a specific piece of software that exhibits a behavior were the miss ratios look like this:

L1-dcache-misses < L2-misses< L3-misses

How can this be the case?

The miss ratios are computed using perf by looking at the refill counters divided by the total number of accesses for each cache in part.

Peter Cordes · Answer 1 · 2020-02-14T17:27:59.963

L1-dcache-misses is the fraction of all loads that miss in L1d cache.

L2-misses is the fraction of requests that make it to L2 at all (miss in L1) and then miss in L2. Similar for L3.

An L1d hit isn't part of the total L2 accesses. (Which makes sense because L2 never even sees it).

This is pretty normal for a workload with good locality over a small working set, but the accesses that miss in L1d have poor spatio-temporal locality and tend to miss in outer caches as well.

L1d filters out all the "easy" very-high-locality accesses, leaving L2 and L3 to only deal with the "harder" accesses. You can say that L1d exists to give excellent latency (and bandwidth) for the smallest hottest working set, while L2 tries to catch stuff that falls through the cracks. Then L3 only sees the "most difficult" parts of your access pattern.

Also, if you're on an Intel CPU, note that perf doesn't just use mem_load_retired.l1_miss events and so on; it tries to count multiple misses to the same line of L1d as a single miss by using the L1D.REPLACEMENT event. LLC-loads and load-misses use OFFCORE_RESPONSE events, not mem_load_retired.l3_hit / miss. See How does Linux perf calculate the cache-references and cache-misses events

(Two loads to the same cache line that isn't ready yet will share the same LFB to track the incoming line, so this accounting makes sense. Also if we care about lines touched / missed instead of individual loads. But L1-dcache-loads uses MEM_INST_RETIRED.ALL_LOADS which does count every load. So not even the perf-reported L1 hit rate is not really the per-instruction L1d load hit rate. It will be higher for any program with spatial locality in its L1d misses.)

Cache Misses L1 < L2 < L3

1 Answers1

Linked