I'm measuring C/C++/intrinsics code execution on Intel Core CPU (RocketLake) and observing non-obvious measuring value shifts.
Two functions f_gpr() (GPR only instructions) and f_avx512() (AVX512 instructions there) run sequently and are measured with core clock cycles PMC counter. There is also thread binding to a given physical CPU core exist and all data and code fit L1 cache. Also, the functions code is pretty plain (arithmetic instructions) - there are no branches.
Procedure:
10: warmup (execute _mm512_or_si512() instruction and wait 56000 cycles in dummy loop to fully power on ZMM registers)
20: serialize (CPUID instruction)
30: read core clock cycles PMC counter
40: call measuring function f_gpr()
50: serialize (CPUID instruction)
60: read core clock cycles PMC counter
70: find core clock cycles difference
80: execute steps 20-70 10 times and find the minimal value
90: execute steps 20-80 for function f_avx512()
91: execute steps 20-80 for function f_gpr() again
In this sequence f_gpr() is measured twice and f_avx512() once (in the middle). I'm observing on step 91 constantly smaller values then on step 80.
When I'm using exactly the same procedure, but without steps 90 and 91 (i.e. only measure f_gpr()) and DON'T apply AVX512 instructions in warmup (step 10), the observing measure corresponds the one from step 91.
It looks like AVX512 code interferents somehow with further GPR code, but it's prevented with serialization used. Modern CPU have complex power saving logic, so I think it affects somehow, even when using core clock cycles PMC counter (for example, high lanes of vector registers might be turned off in low power state and e.g. a 512-bit instruction is executing on 128-bit lanes taking more cycles).
Note: From Agner Fog's docs: it's sufficient to execute dummy 256/512 instruction and wait 56000 cycles to get 256/512 CPU units running at full power.
It looks that workaround is to use in warmup code only such instruction set which is further used in measuring code (e.g. not use AVX512 when measuring GPR or AVX2).
But I'm interesting what is the reason of such behavior. Thanks