How can I account for throttling while benchmarking on a modern CPU?

Question

I have a python script (running under Linux fwiw) that I want to speed up by rewriting some bottleneck parts in different ways to see which one is better/faster.

I can measure the runtime or instrument it with cProfile, but the problem is that I am using a modern laptop/cpu. The runtime from run to the next is changing by as much as 5% - which is, as far as I can tell, due to various forms of CPU throttling (Temperature / Power).

This makes it very hard to check actual code performance differences. Is there some way to compensate for this and/or easily measure actual work done by the CPU?

Well thermal throttling is not the only reason that can explain such a difference (although is plays a significant role). In fact there are *plenty of source of slowdown*. This include for example contexts switches (eg. browsers are good applications to disturb running benchmarks, but even things like ssh connections can have a tiny impact), the allocated memory alignment, the address of the code itself, the distribution of allocated pages in virtual&physical memory, OS scheduling and the behaviour of the OS for its own internal data structures, not to mention I/Os. — Jérôme Richard, Dec 11 '21 at 16:02
[This talk](https://www.youtube.com/watch?v=r-TLSBdHe1A) explains partially that and provide a tool to better deal with such a statistical noise. Not all source of slowdown can be removed. Many processors do not provide a way to set a fixed frequency and the mainstream operating systems are far from being stable in terms of performance (real-time systems are better for that). HPC systems often succeed to reduce the noise to 1-2% by only running few critical processes, tuning the OS (eg. less kernel calls, thread binding) and the processor config (eg. frequency & memory control). — Jérôme Richard, Dec 11 '21 at 16:09

Peter Cordes · Answer 1 · 2021-12-14T01:40:25.670

It may be viable to disable turbo or whatever other vendors call their boost clocks, so the clock frequency stays at the baseline frequency the laptop can sustain.

Running the CPU a lot slower (like 1.6GHz instead of 3 or something) changes the relative cost of a cache miss to DRAM vs. a branch mispredict, because DRAM still takes a similar amount of nanoseconds but that's a lot fewer clock cycles. So it's not perfect if the thing you're comparing involves that kind of tradeoff. Similarly for I/O vs. CPU.

If you can get your system to run at a couple different low but stable frequencies, you can extrapolate performance at higher frequencies even for workloads that are sensitive to memory latency and maybe bandwidth: Dr. Bandwidth explains how in a blog article with slides from his HPC conference talk on it.

For mostly CPU-bound stuff (not memory or I/O), perf stat ./my_program can be useful: look at time in core clock cycles instead of seconds. This doesn't even try to control for relative differences in cache miss costs vs. on-core effects, but is convenient if you're on Linux or another OS that has a handy profiler that can use HW performance counters. (Usually only works on bare metal, not in a VM; most VMs don't virtualize the performance-monitoring unit.)
If L3 cache misses are a significant part of the performance cost, you'd expect core clock cycles to vary with frequency, again because of RAM becoming relatively faster / lower latency compared to the CPU core, meaning out-of-order exec can hide more of the latency of a cache miss.

See also Idiomatic way of performance evaluation? for other benchmark considerations not related to keeping frequency stable.

See Why can't my ultraportable laptop CPU maintain peak performance in HPC for a good example of an ultraportable laptop's CPU frequency vs. time when running power-intensive loads, and the CPU-design reasons for it being that way.

If you can run at different frequencies (even if they are all lower than you want), it is possible to make accurate extrapolations to performance at higher frequencies. See https://sites.utexas.edu/jdm4372/2020/04/02/the-surprising-effectiveness-of-non-overlapping-sensitivity-based-performance-models/. — John D McCalpin, Dec 14 '21 at 00:33

How can I account for throttling while benchmarking on a modern CPU?

1 Answers1