1

I'm measuring C/C++/intrinsics code execution on Intel Core CPU (RocketLake) and observing non-obvious measuring value shifts.

Two functions f_gpr() (GPR only instructions) and f_avx512() (AVX512 instructions there) run sequently and are measured with core clock cycles PMC counter. There is also thread binding to a given physical CPU core exist and all data and code fit L1 cache. Also, the functions code is pretty plain (arithmetic instructions) - there are no branches.

Procedure:

10: warmup (execute _mm512_or_si512() instruction and wait 56000 cycles in dummy loop to fully power on ZMM registers)
20: serialize (CPUID instruction)
30: read core clock cycles PMC counter
40: call measuring function f_gpr()
50: serialize (CPUID instruction)
60: read core clock cycles PMC counter
70: find core clock cycles difference
80: execute steps 20-70 10 times and find the minimal value
90: execute steps 20-80 for function f_avx512()
91: execute steps 20-80 for function f_gpr() again

In this sequence f_gpr() is measured twice and f_avx512() once (in the middle). I'm observing on step 91 constantly smaller values then on step 80.

When I'm using exactly the same procedure, but without steps 90 and 91 (i.e. only measure f_gpr()) and DON'T apply AVX512 instructions in warmup (step 10), the observing measure corresponds the one from step 91.

It looks like AVX512 code interferents somehow with further GPR code, but it's prevented with serialization used. Modern CPU have complex power saving logic, so I think it affects somehow, even when using core clock cycles PMC counter (for example, high lanes of vector registers might be turned off in low power state and e.g. a 512-bit instruction is executing on 128-bit lanes taking more cycles).

Note: From Agner Fog's docs: it's sufficient to execute dummy 256/512 instruction and wait 56000 cycles to get 256/512 CPU units running at full power.

It looks that workaround is to use in warmup code only such instruction set which is further used in measuring code (e.g. not use AVX512 when measuring GPR or AVX2).

But I'm interesting what is the reason of such behavior. Thanks

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Akon
  • 335
  • 1
  • 11
  • 1
    Note that it's not about "powering up" the full width of SIMD units, it's about *lowering* the clock frequency of the whole core (or possibly just raising the voltage to give more margin for the higher current draw from wider SIMD). [SIMD instructions lowering CPU frequency](https://stackoverflow.com/q/56852812) . Changing voltage requires a gap during which the core clock is halted while voltage settles - [Lost Cycles on Intel? An inconsistency between rdtsc and CPU\_CLK\_UNHALTED.REF\_TSC](https://stackoverflow.com/q/45472147) – Peter Cordes Feb 08 '23 at 16:59
  • Are your timed intervals long enough for the CPU to switch frequency again, back up to max L0 turbo during the non-AVX512 parts? If so, you could be getting a frequency transition every time. That pause to stabilize a new voltage+frequency might actually be what you're observing, during some GPR code if AVX-512 instructions are still in-flight in the scheduler to trigger a frequency transition. Or maybe it's something else, I don't have a clear picture of the time-scales (or uop counts) involved in what you're describing. – Peter Cordes Feb 08 '23 at 17:13
  • Thanks for discussion. I'm specially using "core clock cycles" (CCC) to be free from affection of CPU frequency scaling. CCC - It's amount of a particular physical core clock cycles regardless of which frequency these cycles were having and how the frequency was changing. CCC has nothing common with TSC (time stamp counter, which depends from CPU frequency, maybe with a constant scale factor). E.g. overheated and cosequently thermal throttled CPU at measuring the same code would have the same CCC as a non-throttled one, but their TSC would diff proportionally their throttle factor. – Akon Feb 08 '23 at 17:36
  • 1
    Measuring intervals - microseconds or tens of microseconds on 4 GHz. – Akon Feb 08 '23 at 17:41
  • Oh right, the `cpu_clk_unhalted.thread` event wouldn't tick while the CPU is settling on a new frequency/voltage. Derp. The out-of-order machinery should just pause, not having to drain the scheduler or anything, so IPC should be the same before/after (except that 512-bit uops can run full throughput at the new speed if any were in flight.) You're correct that measuring core cycles means the same code will take the same number of cycles at any frequency, if it doesn't have to go off-core for a cache miss or use 512 or 256-bit uops that get throttled at the current frequency. – Peter Cordes Feb 08 '23 at 17:57
  • *and code fit L1 cache.* - But what about the uop cache? Did you unroll into large enough blocks that the first of your 10 repeats has to run from legacy decode instead of the uop cache? What speed ratios are you seeing on your Rocket Lake CPU? Have you tried profiling the whole thing with `perf stat` to see how many uops come from the DSB (uop cache) vs. MITE (legacy decode)? – Peter Cordes Feb 08 '23 at 22:01

0 Answers0