I'm using perf_event_open to read L1 / L2 / L3 cache information using the raw performance events defined here. I am seeing interesting behavior:
- L1D misses < L2D accesses
- L2D misses < L3D accesses
- L1D misses < L2D misses
- L2D misses < L3D misses
Based on my caching knowledge, I would expect the exact opposite: L1 misses > L2 accesses. There is also a L1I Accesses/Misses, though L1D Miss + L1I Miss < L2D Access. I assume that L1D Replacements could account for this difference? Though that would no explain why L2D Cache Misses < L3D Cache accesses. Is there some hardware caching mechanism that would cause this behavior?
The ultimate goal is to try to deduce DDR Bandwidth from DDR to CPU by capturing the global L3 miss count. Though the first step is to understand the caching behavior of each level of cache.