1

Just like a turbo engine has "turbo lag" due to the time it takes for the turbo to spool up, I'm curious what is the "turbo lag" in Intel processors.

For instance, the i9-8950HK in my MacBook Pro 15" 2018 (running macOS Catalina 10.15.7) usually sits around 1.3 GHz when idle, but when I run a CPU-intensive program, the CPU frequency shoots up to, say 4.3 GHz or so (initially). The question is: how long does it take to go from 1.3 to 4.3 GHz? 1 microsecond? 1 milisecond? 100 miliseconds?

I'm not even sure this is up to the hardware or the operating system.

This is in the context of benchhmarking some CPU-intensive code which takes a few 10s of miliseconds to run. The thing is, right before this piece of CPU-intensive code is run, the CPU is essentially idle (and thus the clock speed will drop down to say 1.3 GHz). I'm wondering what slice of my benchmark is running at 1.3 GHz and what is running at 4.3 GHz: 1%/99%? 10%/90%? 50%/50%? Or even worse?

Depending on the answer, I'm thinking it would make sense to run some CPU-intensive code prior to starting the benchmark as a way to "spool up" TurboBoost. And this leads to another question: for how long should I run this "spooling-up" code? Probably one second is enough, but what if I'm trying to minimize this -- what's a safe amount of time for "spooling-up" code to run, to make sure the CPU will run the main code at the maximum frequency from the very first instruction executed?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
swineone
  • 2,296
  • 1
  • 18
  • 32
  • I think it partly depends on the `energy_performance_preference` (or `..._bias`) setting you (or your OS) sets for the CPU's hardware frequency selection. (Linux Q&A: [What are the implications of setting the CPU governor to "performance"?](https://unix.stackexchange.com/q/439340)). But yes, well under 1 ms on Skylake-derived CPUs like yours which let the CPU choose its own frequency, maybe 10s of microseconds with an aggressive EPP setting. – Peter Cordes Oct 08 '20 at 01:12
  • Note that turbo frequency isn't the only kind of "warm up" needed for some benchmarks: if your benchmark touches an array, touching it first before your timed run is a good idea; the first access will cause page faults. See [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) for more – Peter Cordes Oct 08 '20 at 01:13

2 Answers2

2

Evaluation of CPU frequency transition latency paper presents transition latencies of various Intel processors. In brief, the latency depends on the state in which the core currently is, and what is the target state. For an evaluated Ivy Bridge processor (i7-3770 @ 3.4 GHz) the latencies varied from 23 (1.6 GH -> 1.7 GHz) to 52 (2.0 GHz -> 3.4 GHz) micro-seconds.

At Hot Chips 2020 conference a major transition latency improvement of the future Ice Lake processor has been presented, which should have major impact mostly at partially vectorised code which uses AVX-512 instructions. While these instructions do not support as high frequencies as SSE or AVX-2 instructions, using an island of these instructions cause down- and following up-scaling of the processor frequency.

Pre-heating a processor obviously makes sense, as well as "pre-heating" memory. One second of a prior workload is enough to reach the highest available turbo frequency, however you should take into account also temperature of the processor, which may down-scale the frequency (actually CPU core and uncore frequencies if speaking about one of the latest Intel processors). You are not able to reach the temperature limit in a second. But it depends, what you want to measure by your benchmark, and if you want to take into account the temperature limit. When speaking about temperature limit, be aware that your processor also has a power limit, which is another possible reason for down-scaling the frequency during the application run.

Another think that you should take into account when benchmarking your code is that its runtime is very short. Be aware of the runtime/resources consumption measurement reliability. I would suggest an artificially extending the runtime (run the code 10 times and measure the overall consumption) for better results.

Andrew
  • 958
  • 13
  • 25
  • That's measuring the latency after something decides to transition, right? That's only part of the reaction time to load even with hardware P-state management (e.g. from rated-max to turbo on pre-Skyalke, or from idle to turbo on Skylake *if* the OS hands off control). Or if the OS chooses to maintain software control of CPU frequency below the turbo range, transition from idle to the highest (non-turbo) P-state can take 10s of milliseconds, e.g. until the next timer interrupt or more. The actual transition latency is more like the cost of a transition (how long the clock is stopped). – Peter Cordes Oct 15 '20 at 10:17
  • Yes, you are right, the study used the userspace governor. The target P-state is set by the scaling governor that is used, so we do not speak about the hardware any more, but about the reaction time of the scaling governor. – Andrew Oct 15 '20 at 12:12
1

I wrote some code to check this, with the aid of the Intel Power Gadget API. It sleeps for one second (so the CPU goes back to its slowest speed), measures the clock speed, runs some code for a given amount of time, then measures the clock speed again.

I only tried this on my 2018 15" MacBook Pro (i9-8950HK CPU) running macOS Catalina 10.15.7. The specific CPU-intensive code being run between clock speed measurements may also influence the result (is it integer only? FP? SSE? AVX? AVX-512?), so don't take these as exact numbers, but only order-of-magnitude/ballpark figures. I have no idea how the results translate into different hardware/OS/code combinations.

The minimum clock speed when idle in my configuration is 1.3 GHz. Here's the results I obtained in tabular form.

+--------+-------------+
| T (ms) | Final clock |
|        | speed (GHz) |
+--------+-------------+
| <1     | 1.3         |
| 1..3   | 2.0         |
| 4..7   | 2.5         |
| 8..10  | 2.9         |
| 10..20 | 3.0         |
| 25     | 3.0-3.1     |
| 35     | 3.3-3.5     |
| 45     | 3.5-3.7     |
| 55     | 4.0-4.2     |
| 66     | 4.6-4.7     |
+--------+-------------+

So 1 ms appears to be the minimum amount of time to get any kind of change. 10 ms gets the CPU to its nominal frequency, and from then on it's a bit slower, apparently over 50 ms to reach maximum turbo frequencies.

swineone
  • 2,296
  • 1
  • 18
  • 32
  • I guess it makes sense for a laptop to have very conservative CPU frequency ramp-up. Was this on battery, or AC power? I don't know if MacOS uses hardware p-state or if it does software control of CPU frequency (up until "rated" max-non-turbo, at which point even on older CPUs it's up to the hardware to decide when to actually turbo). My Arch GNU/Linux i7-6700k desktop jump to max in well under 1ms, I'm pretty sure, with hardware p-state control, energy_performance_preference = "balance_performance". – Peter Cordes Oct 08 '20 at 01:51
  • It's on AC power. By curiosity, how can you tell that your desktop does it in under 1 ms? – swineone Oct 08 '20 at 02:15
  • `perf stat ./a.out` shows average clock speed for the whole process (using HW performance counters to measure cycles, divided by CPU time), and even very short total times show an average near max turbo. Also, I know from Intel's IDF2015 presentation about Skylake's hardware p-state feature that one of the majore points is to react very quickly to bursty workloads (like web page rendering) to make that snappy, then quickly reduce back to idle. And that the onboard power-management microcontroller evaluates data and makes decisions on the order of microseconds. – Peter Cordes Oct 08 '20 at 02:22
  • https://en.wikichip.org/wiki/File:Intel_Architecture,_Code_Name_Skylake_Deep_Dive-_A_New_Architecture_to_Manage_Power_Performance_and_Energy_Efficiency.pdf has the slides from that talk, but unfortunately the audio of the presentation seems to have disappeared from the web :/ – Peter Cordes Oct 08 '20 at 02:23
  • [Slowing down CPU Frequency by imposing memory stress](https://stackoverflow.com/q/63399456) has some example `perf stat` outputs for longer runs, showing downclocking on memory-bound workloads. Also not exactly what you're asking about, but [Lost Cycles on Intel? An inconsistency between rdtsc and CPU\_CLK\_UNHALTED.REF\_TSC](https://stackoverflow.com/q/45472147) has some details about perf counters during a frequency-switch between different turbo levels. – Peter Cordes Oct 08 '20 at 02:33
  • Semi-related, re: OS CPU frequency decisions on Linux (on a CPU before skylake, so not hardware P-state management): [Why does this delay-loop start to run faster after several iterations with no sleep?](https://stackoverflow.com/a/38300014) – Peter Cordes Oct 08 '20 at 02:35