intel alderlake performance degradation after spin wait

Question

I'm tunning my program for low-latency.

I have a tight calculation function calc(); which is using SIMD floating point instructions heavily.

I had test the performance of calc(); using perf command. it shows that this calc function is using ~10k instructions and ~5k cpu cycles in average.

However, when I put this calc function after a spin-wait like

while(true) {
  if (!flag.load(std::memory_order_acquire)) {
      continue;
  }

  calc();
}

the calc part is using about 10k cycles. and other perf counters like l1d-cache-misses, llc-misses, branch-misses and instructions remain the same.

Can anyone help me to explain how this happened and what should I do to avoid this? I mean to keep the calc function as fast as possible.

Also, I have 2 interesting findings:

If I got the flag variable set in a very short period(less than 1ms). I cannot notice any performance degradation for function calc.
if I add some garbage simd floating point calcution in the middle of spin-wait. I can achieve the expected performance.

My CPU is 13900K. I also tested at 12900K and Ice Lake CPUs like Xeon 8368. looks they have the same behaviour.

I noticed from Optimization Reference Manual that there's something called Thread Director which can automatically detect the thread classes in runtime and there's a special class called Pause (spin-wait) dominated code. I don't know if this is related but looks like after some time period, the CPU detected that the thread is in a spin-wait loop and then reduced the resource that is allocated to this thread ?

Update

I'm testing on a redhat real-time kernel. closed efficient core from bios, set cpu affinity to a specic core id and set sechudle as FIFO and priority to 99. Also I have blocked all interrupts as I can. and reduce local-timer-interrupts to 1 once a second.

I also tried to add _mm_pause() in the middle of spin loop(as suggessted from Optimization Reference Manual). but it not help.

I bought the 13900k server from a special vendor and used liquid coding system. overlocked all 8 performance core to 5.8GHz. the boot command line of system is

# cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-4.18.0-425.13.1.rt7.223.el8_7.x86_64 root=/dev/mapper/rhel-root ro crashkernel=auto rd.lvm.lv=rhel/root rhgb quiet isolcpus=0-6 rcu_nocbs=0-6 spectre_v2=off mitigations=off iommu=off intel_iommu=off tsc=reliable pcie_port_pm=off ipv6.disable=1 ipmi_si.force_kipmid=0 acpi_irq_nobalance rcu_nocb_poll clocksource=tsc selinux=0 intel_pstate=disable pcie_aspm=performance nosoftlockup audit=0 nmi_watchdog=0 mce=ignore_ce nohz=on intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll transparent_hugepage=never hpet=disabled noht nohz_full=0-6 skew_tick=1

Alderlake is a big-little processor. Thus your process can move from energy-efficient cores to high-performance cores and vice-versa. You need to check on which core your process is scheduled. `perf sched` (https://www.man7.org/linux/man-pages/man1/perf-sched.1.html) can help you to do that. You need to care about the frequency scaling and turbo mode too : can you stabilize the frequency (see: https://stackoverflow.com/questions/75512363/#75516229) just to be sure it is not that? — Jérôme Richard, Feb 25 '23 at 16:04
Your problem seems very close to the question https://stackoverflow.com/questions/75548704/intel-mkl-multi-threaded-matrix-vector-multiplication-sgemv-slow-after-little (though it is on a AMD CPU). It might be due to the same problem. — Jérôme Richard, Feb 25 '23 at 16:07
@JérômeRichard I'm testing on a redhat real-time kernel. closed efficient core from bios, set cpu affinity to a specic core id and set sechudle as FIFO and priority to 99. Also I have blocked all interrupts as I can. and reduce local-timer-interrupts to 1 once a second. — VariantF, Feb 25 '23 at 16:21
@JérômeRichard I've looked at https://stackoverflow.com/questions/75548704/intel-mkl-multi-threaded-matrix-vector-multiplication-sgemv-slow-after-little. It's very similar. I also tried to add _mm_pause() in the middle of spin loop(as suggessted from Optimization Reference Manual). but it not help — VariantF, Feb 25 '23 at 16:26
Also, I bought the 13900k server from a special vendor and used liquid coding system. overlocked all 8 performance core to 5.8GHz. the boot command line of system is: — VariantF, Feb 25 '23 at 16:35
``` BOOT_IMAGE=(hd0,gpt2)/vmlinuz-4.18.0-425.13.1.rt7.223.el8_7.x86_64 root=/dev/mapper/rhel-root ro crashkernel=auto rd.lvm.lv=rhel/root rhgb quiet isolcpus=0-6 rcu_nocbs=0-6 spectre_v2=off mitigations=off iommu=off intel_iommu=off tsc=reliable pcie_port_pm=off ipv6.disable=1 ipmi_si.force_kipmid=0 acpi_irq_nobalance rcu_nocb_poll clocksource=tsc selinux=0 intel_pstate=disable pcie_aspm=performance nosoftlockup audit=0 nmi_watchdog=0 mce=ignore_ce nohz=on intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll transparent_hugepage=never hpet=disabled noht nohz_full=0-6 skew_tick=1``` — VariantF, Feb 25 '23 at 16:36
With hardware P-state management disabled, that rule out a lot of possible explanations like CPU frequency, although turbo clocks (above the highest P-state) are always hardware managed. The clock doesn't tick during the time it takes for the CPU to settle at a new frequency / voltage. — Peter Cordes, Feb 25 '23 at 18:17
But I suspect your benchmark is short enough that it's affected by [SIMD instructions lowering CPU frequency](https://stackoverflow.com/q/56852812), with 256-bit math running at reduced throughput until the core downclocks from max L0 to the L1 "license" speed. With an FPU instruction in the spin loop, it keeps the frequency at the L1 max, so the CPU's ready to execute FP math at full per-clock throughput as soon as it leaves the spin loop. — Peter Cordes, Feb 25 '23 at 18:18
@PeterCordes AFAIK, the impact of AVX/AVX-512 should be small on the frequency of recent processor. https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html (I guess from BeeOnRope too :p) tends to go in this direction. I cannot find information for Alderlake so far about this (as for AMD too). For the turbo, it can be disabled, but it is not clear whether the OP did it. Anyway, the frequency can be seen in `perf` statistics so such an effect can be easily observed. — Jérôme Richard, Feb 25 '23 at 20:41
@JérômeRichard: Yes, the frequency reduction is small, but if the L0 frequency isn't the same as the L1 frequency, "heavy" 256-bit instructions will be throttled (to less than 1 per clock cycle) while the current CPU clock speed is above the L1 frequency. That's what can explain worse performance when measuring in core clock cycles, not RDTSC reference cycles / wall-clock time. If the OP was using RDTSC, then the pause while the CPU changes speed could perhaps also explain it, if it can settle at the new frequency in only 5k cycles. — Peter Cordes, Feb 25 '23 at 22:16
@JérômeRichard: note how brief this timed interval is; it's short enough to be affected by the transient behaviour, not just the new clock speed the CPU picks. — Peter Cordes, Feb 25 '23 at 22:18
@PeterCordes I also tried to benchmark it with clock_gettime(); which i think is rdtsc based time meter. it also gives me a longer time when I put calc after spin-wait. (750ns -> 1500ns). — VariantF, Feb 26 '23 at 05:25
Yeah, that's consistent with having to change frequency, pausing the core clock (but not the TSC) for some time while things settle. [Lost Cycles on Intel? An inconsistency between rdtsc and CPU\_CLK\_UNHALTED.REF\_TSC](https://stackoverflow.com/q/45472147). So probably there was some time *before* the CPU decided to change frequency, during which you got reduced per-clock throughput for 256-bit instructions. — Peter Cordes, Feb 26 '23 at 05:29
If my understanding is correct, setting a lower base frequency should fix the L0-L1 frequency switch and thus the effect although it may decrease the performance of scalar codes, right? If so, this could be a good test to perform, just to be sure. — Jérôme Richard, Feb 26 '23 at 13:33
@PeterCordes is there any docs from intel that talk about 'licenses' L0-L2 officially? and how can I know whether my core is running at L0 or L1? — VariantF, Feb 26 '23 at 16:50
@JérômeRichard: Yes, if you lower the *max* frequency so it can't ever turbo above the L1 license, that might avoid throttling and then switching. But you might only be able to control frequency, and AVX might require a higher voltage at the same frequency. Disabling turbo entirely might work as an experiment. — Peter Cordes, Feb 26 '23 at 17:20
@VariantF: I don't know about official docs, I haven't checked recently. BeeOnRope suggested https://en.wikichip.org/wiki/intel . To check for your own CPU, look at the current frequency of a core while it's running 256-bit FMAs vs. scalar code. You might not be able to see the current voltage, so you can't be sure you're in L1, but you can be sure you're in L0 if the current frequency is above the L1 frequency. — Peter Cordes, Feb 26 '23 at 17:24
@PeterCordes how can I get current CPU frequency in linux? the /proc/cpuinfo always shows the cpu MHz: 3001.000 even when I start running my program. and turbostat always shows Avg_MHz-5800, Bzy_MHz=5791 and TSC_MHz=2995 no matter which program I'm running on that core. — VariantF, Feb 26 '23 at 17:47
@PeterCordes does this matter that I have set intel_pstate=disabled in kernel boot parameters. — VariantF, Feb 26 '23 at 17:49
@VariantF Check the last link in the first comment : it contains useful information about this, especially the sys-fs files. — Jérôme Richard, Feb 26 '23 at 19:01
@JérômeRichard Sorry, I'm not quiet understand at this point. since I've set intel_pstate=disabled from boot option. there's no /sys/device/system/cpu/intel_pstate folder. and cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq shows 3001000. does this mean my cpu is running at 3.0GHz constantly? what does it mean for turbostat shows the Avg/Bzy Freq=5.8G ? — VariantF, Feb 27 '23 at 03:54
I also checked the link https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html look like in latest microarchitecture, the frequency doesn't change but the voltage changes besides normal/avx2/avx512 instructions? — VariantF, Feb 27 '23 at 03:58
@VariantF: `intel_pstate=disabled` probably doesn't help; that disables HW management of clock speeds below "base" non-turbo clock speed, i.e. below your 2.7 GHz. But turbo is always hardware-managed because it has to be opportunistic (according to thermal and electrical limits) and fast responding to changes, including in workload (like 256-bit instructions). — Peter Cordes, Feb 27 '23 at 23:01
@VariantF: [Varying CPU frequencies on Intel](https://stackoverflow.com/a/75507297) has some info about the challenges of reading the CPU frequency. I haven't used `turbostat` much; hopefully it's good. Since you're already using `perf` to count the `cycles` event, have it measure the `task-clock` event, too, so it can calculate cycles per second = frequency for you, over some interval your code was running. But yeah, if your Alder Lake is the same as Ice Lake, it might just be a voltage change, not frequency, that's needed to allow full-throughput L1 instructions. — Peter Cordes, Feb 27 '23 at 23:05

intel alderlake performance degradation after spin wait

Update

0 Answers0