1

I am attempting to do some data science with CPU core temperatures. I need to monitor how CPU core temperature changes over time. I am attempting to use two tools to do this:

  1. lm-sensors for measuring core and package temperature
  2. stress for generating a load

The problem I am seeing is that as soon as stress starts the temperature skyrockets, and as soon as it stops it plummets. This can't be right!

Here is a little shell script and output to demonstrate the problem:

Script:

sensors | grep Core
stress -c 8 -t 1
sensors | grep Core
str=$'Sleeping for 1s \n' 
read -t 1 -p "$str"
sensors | grep Core

Output:

Core 0:        +49.0°C  (high = +100.0°C, crit = +100.0°C)
Core 1:        +51.0°C  (high = +100.0°C, crit = +100.0°C)
Core 2:        +49.0°C  (high = +100.0°C, crit = +100.0°C)
Core 3:        +47.0°C  (high = +100.0°C, crit = +100.0°C)
stress: info: [6956] dispatching hogs: 8 cpu, 0 io, 0 vm, 0 hdd
stress: info: [6956] successful run completed in 1s
Core 0:        +81.0°C  (high = +100.0°C, crit = +100.0°C)
Core 1:        +73.0°C  (high = +100.0°C, crit = +100.0°C)
Core 2:        +73.0°C  (high = +100.0°C, crit = +100.0°C)
Core 3:        +68.0°C  (high = +100.0°C, crit = +100.0°C)
Sleeping for 1s 
Core 0:        +51.0°C  (high = +100.0°C, crit = +100.0°C)
Core 1:        +53.0°C  (high = +100.0°C, crit = +100.0°C)
Core 2:        +51.0°C  (high = +100.0°C, crit = +100.0°C)
Core 3:        +48.0°C  (high = +100.0°C, crit = +100.0°C)
       +51.0°C  (high = +100.0°C, crit = +100.0°C)

Is this expected behavior? Is it physically possible for the temperatures sensors to see that much change this quickly? If so, I'm in trouble in terms of characterizing temperature changes. There is no time for me to gather data. The temperature basically spikes instantaneously, doesn't change while the jobs are running, and the vanishes as soon as the job finishes.

I ran the same experiment on an RPi and it took the fully loaded quad core about 60 seconds before frequency scaling set in, so I have no idea whats happening now that I am trying to bring the project to a more complex architecture.

This is on an Intel Core i7 Skylake architecture. Any help understanding this would be greatly appreciated.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 2
    Consider the thermal mass of a CPU die, relative to the thermal power being dissipated, and consider what the step response of such a system would look like. It's reasonable to see fast changes for a small die with high power flux, and my experience on an i7-8750h is similar to your observations. – nanofarad Jul 19 '20 at 04:46
  • @nanofarad - I see. The thing is that I don't see any frequency scaling occurring at that point. The temperature just levels out. Why doesn't it continue to increase at that rate? Is there some natural equilibrium that is reached? – ericstevens26101 Jul 19 '20 at 04:49
  • 2
    Looks normal. Also note, the type and quality of the CPU cooler will influence both the rate and limits reached. But pegging all cores is roughly the same as flipping the switch on a 125W light bulb -- lights up quickly and turns off quickly too. – David C. Rankin Jul 19 '20 at 05:02
  • 1
    What do you mean you don't see any frequency scaling? It stays at idle clock speed the whole time, like 800MHz? On my i7-6700k (Skylake quad-core desktop), starting a high-power process like video encoding (x264 or x265) will ramp the cores up from ~25C idle (room temp) to ~50 or 60C within a second, then they quickly settles near 70C or so, depending on max all-core turbo of 3.9 or 4.0GHz `energy_performance_preference`. (Intel since Skylake has hardware power management so it can ramp up from idle clocks in micro-seconds, not milliseconds. clock speed decisions are made in hardware) – Peter Cordes Jul 19 '20 at 05:07
  • @DavidC.Rankin Thank you for your reply. I know I need to incorporate the CPU cooling mechanisms into my model and I am struggling to do that with such a complex architecture. It was **a lot** easier on the Raspberry Pi when my CPU cooler was a stick-on heat sync and me blowing on it. – ericstevens26101 Jul 19 '20 at 05:07
  • @PeterCordes ( sorry Peter, yes, to you) No, I mean that once it reaches its high performance speed it stays there (3.1GHz). There is no frequency scaling to drop the frequency (like DVFS), so I don't see what prevents the temperature from continuing to rise. – ericstevens26101 Jul 19 '20 at 05:11
  • 1
    If you mean throttling (down from max all-core turbo if that's higher than the "guaranteed" sustained frequency), that depends on workload. To make enough heat to make turbo not sustainable, you need to run SIMD FMAs or something similarly high-power, not just a dummy loop. (e.g. Prime95 or video encoding.) Even Intel's stock cooler typically has enough cooling capacity to sustain some turbo with all cores busy on a lot of workloads, staying below sustained TDP. – Peter Cordes Jul 19 '20 at 05:13
  • 1
    That's where the cooling come into play. It is a heat-tranfer problem, (2nd order differential). If there was no heat sink, the temp would continue to rise until thermal monitoring shut the CPU down or it burned up. With a cooler, after the initial step input of heat due to activity, the temperature will reach steady-state with the cooler adsorbing the about of heat the CPU is producing. At that point the temp will remain constant until the load changes. – David C. Rankin Jul 19 '20 at 05:14
  • 1
    *I don't see what prevents the temperature from continuing to rise.* - Physics. A higher temperature difference (between chip and heat sink, and between heat sink and air) means more heat transfer per time (aka power). The thermal mass of the chip + heat sink is like a capacitor, the thermal connection from chip to air is like a resistor, and the constant heat power input is like current. So the temperature asymptotically approaches equilibrium, just like in an RC circuit. The equilibrium point (above ambient) depends linearly on total power. (conduction is linear, not T^4 radiative) – Peter Cordes Jul 19 '20 at 05:17
  • @PeterCordes That is incredibly helpful. My real goal is to model core temperature using performance counter events. You seem like an expert. Is this a lost cause on this complex of an architecture? – ericstevens26101 Jul 19 '20 at 05:25
  • 1
    Look into `powertop`; there are things like PMU events for energy/power, I think, but I haven't looked into it. Slides from Intel's IDF2015 presentation about Skylake power management are https://en.wikichip.org/wiki/File:Intel_Architecture,_Code_Name_Skylake_Deep_Dive-_A_New_Architecture_to_Manage_Power_Performance_and_Energy_Efficiency.pdf, unfortunately IDK if the audio of the presentation is still online anywhere. Intel's original site seems to be down, too bad; it was *excellent* and covered a lot of territory on CPU power design in general. – Peter Cordes Jul 19 '20 at 05:29
  • 2
    For chip specific thermal properties (power consumption, throttling, etc..) the hardware data sheet for the chip gives detailed specs, e.g. [8th and 9th Generation Intel® Core™ Processor Families Datasheet, Volume 1 of 2 -- See: Chapter 5 Thermal Management](https://www.intel.com/content/www/us/en/products/docs/processors/core/8th-gen-core-family-datasheet-vol-1.html) – David C. Rankin Jul 19 '20 at 05:31
  • 1
    Also related: http://www.lighterra.com/papers/modernmicroprocessors/, and https://www.realworldtech.com/power-delivery/ for a really low-level look at the CPU design factors. It might be possible to measure temperature vs. time if you do it faster, with much shorter sampling intervals. That might require a more low-level tool than the `lm-sensors` front-end, perhaps reading files in /proc or /sys would be good enough. (I think the cpu temp kernel driver exposes info directly that way.) – Peter Cordes Jul 19 '20 at 05:32
  • @PeterCordes and DavidC.Rankin - You guys are awesome, thank you so much. This project is really about thermal aware process scheduling but I feel it would be inappropriate to ask those questions in this thread. Hope you guys are around when I start digging into that mess. Thanks again! – ericstevens26101 Jul 19 '20 at 05:36

1 Answers1

4

This is pretty normal. There isn't much thermal mass in the chip + heat sink, compared to the power that flows through it when it's > 50C above ambient, so it quickly reaches equilibrium.

On my i7-6700k (Skylake quad-core desktop), starting a high-power process like video encoding (x264 or x265) will ramp the cores up from ~25C idle (room temp) to ~50 or 60C within a second, then they quickly settles near 70C or so, depending on max all-core turbo of 3.9 or 4.0GHz via energy_performance_preference. (Intel since Skylake has hardware power management so it can ramp up from idle clocks in micro-seconds, not milliseconds. Clock speed decisions are made in hardware)

I mean that once it reaches its high performance speed it stays there (3.1GHz). There is no frequency scaling to drop the frequency (like DVFS)

If you mean throttling (down from max all-core turbo if that's higher than the rated / "guaranteed" sustained frequency), that depends on workload. To make enough heat to make turbo not sustainable, you need to run SIMD FMAs or something similarly high-power, not just a dummy loop. (e.g. Prime95 or video encoding.)

Even Intel's stock cooler typically has enough cooling capacity to sustain some turbo with all cores busy on a lot of workloads, staying below sustained TDP. Or maybe your CPU's max all-core turbo isn't any higher than its rated speed. i7-6700k isn't: 4.0GHz for both. Only 1 or 2 core turbo is 4.2GHz. (And that's not really limited by overal thermals, more just how fast the transistors are and / or not creating a hot-spot on the one core that's active.)

Of course the "k" models are overclockable so the stock turbo settings are conservative, but I like to keep my fans quiet, not have a burst of fan spin-up sound when a clunky web-page loads.

My cooler is a CoolerMaster Gemini II, big clunky thing with heat pipes and a big fan that (at room temp) barely turns, so mine has more thermal mass than a stock cooler. And the rear case fan literally stops when CPU / mobo temps are below ~40C, as I configured it in the BIOS.

I don't see what prevents the temperature from continuing to rise.

Physics. A higher temperature difference (between chip and heat sink, and between heat sink and air) means more heat transfer per time (aka power). The thermal mass of the chip + heat sink is like a capacitor, the thermal connection from chip to air is like a resistor, and the constant heat power input is like current.

So the temperature asymptotically approaches equilibrium, just like in an RC circuit. The equilibrium point (above ambient) depends linearly on total power.

(Heat conduction (and fan-forced convection) scales linearly with temperature difference, just like electrical conductance / resistance. It's the dominant factor here, not radiative transfer that scales with absolute T^4)

Also, dynamic fan speed that ramps up based on CPU temperature.

BTW, I think the heatpipes on my cooler explain the very quick ramp-up to ~60C, and then gradual ramp-up the rest of the way: the CPU itself can get hot very fast, and starts transferring heat into the heatpipes (which go into the base of the cooler, so there's just some thermal paste and copper). It can absorb heat directly by vaporizing its working fluid. But with sustained heat input, the heat has to go somewhere: into the mass of fins, and from there to the air. So the gradual asymptotic increase may be as the fins themselves heat up, having to dissipate heat into the air, not just conduct it out of the heat-pipe.


There are systems built without enough sustained cooling to handle sustained max-turbo. For x86 systems, you'll find those in laptops, especially light-weight and especially ultra-portable laptops with Core-Y CPUs (TDP of like 7.5W, but still full Skylake cores with AVX2 that can turbo pretty high).

Why can't my ultraportable laptop CPU maintain peak performance in HPC has some data showing clock speed falling off, and my answer there explains why they build systems this way: burst performance is what you want for interactive use, and the combo of light weight (fans / heat sinks) + high burst inevitably means they can't sustain their max turbo.

But desktops can be heavy, and people do want machines that can crunch numbers for a long time at clock speeds as high as possible.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847