L1-dcache-misses
is the fraction of all loads that miss in L1d cache.
L2-misses is the fraction of requests that make it to L2 at all (miss in L1) and then miss in L2. Similar for L3.
An L1d hit isn't part of the total L2 accesses. (Which makes sense because L2 never even sees it).
This is pretty normal for a workload with good locality over a small working set, but the accesses that miss in L1d have poor spatio-temporal locality and tend to miss in outer caches as well.
L1d filters out all the "easy" very-high-locality accesses, leaving L2 and L3 to only deal with the "harder" accesses. You can say that L1d exists to give excellent latency (and bandwidth) for the smallest hottest working set, while L2 tries to catch stuff that falls through the cracks. Then L3 only sees the "most difficult" parts of your access pattern.
Also, if you're on an Intel CPU, note that perf
doesn't just use mem_load_retired.l1_miss
events and so on; it tries to count multiple misses to the same line of L1d as a single miss by using the L1D.REPLACEMENT
event. LLC-loads and load-misses use OFFCORE_RESPONSE
events, not mem_load_retired.l3_hit
/ miss. See
How does Linux perf calculate the cache-references and cache-misses events
(Two loads to the same cache line that isn't ready yet will share the same LFB to track the incoming line, so this accounting makes sense. Also if we care about lines touched / missed instead of individual loads. But L1-dcache-loads
uses MEM_INST_RETIRED.ALL_LOADS
which does count every load. So not even the perf-reported L1 hit rate is not really the per-instruction L1d load hit rate. It will be higher for any program with spatial locality in its L1d misses.)