Perf shows L1-dcache-load-misses in a block with no memory access

Question

Below is a block of code that perf record flags as responsible for 10% of all L1-dcache misses, but the block is entirely movement between zmm registers. This is the perf command string:

perf record -e L1-dcache-load-misses -c 10000 -a -- ./Program_to_Test.exe

The code block:

Round:
vmulpd zmm1,zmm0,zmm28
VCVTTPD2QQ zmm0{k7},zmm1
VCVTUQQ2PD zmm2{k7},zmm0
vsubpd zmm3,zmm1,zmm2
vmulpd zmm4,zmm3,zmm27
VCVTTPD2QQ zmm5{k7}{z},zmm4

VPCMPGTQ k2,zmm5,zmm26
VPCMPEQQ k3 {k7},zmm5,zmm26
KADDQ k1,k2,k3

VCVTQQ2PD zmm2{k7},zmm0
VDIVPD zmm1{k7},zmm2,zmm28 ; Divide by 100
VPXORQ zmm2{k7},zmm2,zmm2
vmovupd zmm2,zmm1
VADDPD zmm2{k1},zmm1,zmm25

I get similar results for that code block with other L1 measures such as l1d.replacement.

My question is, how can a block that is only zmm register movement generate L1 cache misses? I didn't think registers go to memory at all. In fact, the last memory access is 10 instructions above this block of code; the other 9 instructions are all register-to-register instructions.

Whatever HW event perf uses, it's presumably not a "precise" event. You might want to look at `mem_load_retired.l1_miss` to attribute L1 misses to specific load uops. — Peter Cordes, Aug 04 '20 at 16:57
Also, you can't use `1./100` as a reciprocal? It's not exactly representable as a double, but div is *much* slower than multiply. And maybe I'm missing something, but `vmovupd zmm2, zmm1` overwrites the merge-masked result of the preceding `vpxorq`-zeroing. If that's supposed to zero some elements, can you simply use zero-masking instead, or a blend? — Peter Cordes, Aug 04 '20 at 17:01
Thanks for the comment re using the reciprocal. I noticed when I posted this that I still have a div instruction. Also, examining this code again the vpxorq instruction looks unnecessary. I'll test it and see. — RTC222, Aug 04 '20 at 17:06
[How does Linux perf calculate the cache-references and cache-misses events](https://stackoverflow.com/q/55035313) shows what HW event `perf` actually uses for `L1-dcache-load-misses` - `L1D.REPLACEMENT`! So that counts multiple misses to the same line as only 1 miss, but it's not synchronous with instructions (e.g. HW prefetch can probably cause it). [Can perf account for all cache misses?](https://stackoverflow.com/q/29881885) is related. — Peter Cordes, Aug 04 '20 at 17:09
I suspected hardware prefetch because on the next iteration we will read 64 bytes from memory again. The L1 cache misses may be delayed from above. As you mentioned, the counters are not 100% precise. — RTC222, Aug 04 '20 at 17:11

Hadi Brais · Accepted Answer · 2020-08-06T17:03:09.387

The event L1-dcache-load-misses is mapped to L1D.REPLACEMENT on Sandy Bridge and later microarchitectures (or mapped to a similar event on older microarchitectures). This event doesn't support precise sampling, which means that a sample can point to an instruction that couldn't have generated the event being sampled on. (Note that L1-dcache-load-misses is not supported on any current Atom.)

Starting with Linux 3.11 running on a Haswell+ or Silvermont+ microarchitecture, samples can be captured with eventing instruction pointers by specifying a sampling event that meets the following two conditions:

The event supports precise sampling. You can use, for example, any of the events that represent memory uop or instruction retirement. The exact names and meaning of the events depends on the microarchtiecture. Refer to the Intel SDM Volume 3 for more information. There is no event that supports precise sampling and has the same exact meaning as L1D.REPLACEMENT. On processors that support Extended PEBS, only a subset of PEBS events support precise sampling.
The precise sampling level is enabled on the event. In Linux perf, this can be done by appending ":pp" to the event name or raw event encoding, or "pp" after the terminating slash of a raw event specified in the PMU syntax. For example, on Haswell, the event mem_load_uops_retired.l1_miss:pp can be specified to Linux perf.

With such an event, when the event counter overflows, the PEBS hardware is armed, which means that it's now looking for the earliest possible opportunity to collect a precise sample. When there is at least one instruction that will cause an event during this window of time, the PEBS hardware will be eventually triggered by one of these instructions with bias toward high-latency instructions. When the instruction that triggeres PEBS retires, the PEBS microcode routine will execute and captures a PEBS record, which contains among other things the IP of the instruction that triggered PEBS (which is different from the architectural IP). The instruction pointer (IP) used by perf to display the results is this eventing IP. (I noticed there can be a negligible number of samples pointing to instructions that couldn't have caused the event.)

On older mircroarchitecures (before Haswell and Silvermont), the "pp" precise sampling level is also supported. PEBS on these processors will only capture the architectural event, which points to the static instruction that immediately follows the PEBS triggering instruction in program order. Linux perf uses LBR, if possible, which contains source-target IP pairs to determine if that captured IP is a target of a jump. If that was the case, it will add the source IP as the eventing IP to the sample record.

Some microarchitectures support one or more events with better sampling distribution (how much better depends on the microarchitecture, the event, the counter, and the instructions being executed at the time in which the counter is about to overflow). In Linux perf, precise distribution can be enabled, if supported, by specifying the precise level "ppp."

So `perf` doesn't default to using PEBS, even if it's available for an event? I thought PEBS was generally better and more efficient (at least for record, if not stat) because it could record samples into a buffer without interrupting execution. (Although being the default could be separate from being better). — Peter Cordes, Aug 06 '20 at 03:23
@PeterCordes - usually the buffer just has size 1, so it's not more efficient, unless specific conditions are met for "large PEBS". AFAIK without `:p[p...]` suffix you don't get precise events (although some things like `perf record` might default to a precise event when the event is not explicitly specified). — BeeOnRope, Aug 06 '20 at 04:52
@Hadi - your last paragraph is not clear to me. There is some instruction that causes the counter to overflow (this is necessarily an instruction that causes the event in question), there is the instruction associated with the hardware PEBS sample, there is the instruction associated with the non-hardware perf sample (e.g,. when the stack trace is captured), and maybe more, and maybe some of these are the same in some or all scenarios. It's not clear which of these you are talking about at which point of that paragraph. — BeeOnRope, Aug 06 '20 at 04:57
@BeeOnRope I hope it's better now. I wrote that just before going to sleep :) — Hadi Brais, Aug 06 '20 at 18:21
@HadiBrais - thanks, I think it is a bit clearer. I'd say I don't fully understand it, but maybe that's just me. One thing I didn't understand was "during this window of time" - which window of time (what are start/end points of this window)? I also didn't understand "which is different from the architectural IP" what is the architectural IP in this sense? — BeeOnRope, Aug 22 '20 at 21:49

Perf shows L1-dcache-load-misses in a block with no memory access

1 Answers1

Linked