Below is a block of code that perf record flags as responsible for 10% of all L1-dcache misses, but the block is entirely movement between zmm registers. This is the perf command string:
perf record -e L1-dcache-load-misses -c 10000 -a -- ./Program_to_Test.exe
The code block:
Round:
vmulpd zmm1,zmm0,zmm28
VCVTTPD2QQ zmm0{k7},zmm1
VCVTUQQ2PD zmm2{k7},zmm0
vsubpd zmm3,zmm1,zmm2
vmulpd zmm4,zmm3,zmm27
VCVTTPD2QQ zmm5{k7}{z},zmm4
VPCMPGTQ k2,zmm5,zmm26
VPCMPEQQ k3 {k7},zmm5,zmm26
KADDQ k1,k2,k3
VCVTQQ2PD zmm2{k7},zmm0
VDIVPD zmm1{k7},zmm2,zmm28 ; Divide by 100
VPXORQ zmm2{k7},zmm2,zmm2
vmovupd zmm2,zmm1
VADDPD zmm2{k1},zmm1,zmm25
I get similar results for that code block with other L1 measures such as l1d.replacement.
My question is, how can a block that is only zmm register movement generate L1 cache misses? I didn't think registers go to memory at all. In fact, the last memory access is 10 instructions above this block of code; the other 9 instructions are all register-to-register instructions.