Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC

Question

On recent CPUs (at least the last decade or so) Intel has offered three fixed-function hardware performance counters, in addition to various configurable performance counters. The three fixed counters are:

INST_RETIRED.ANY
CPU_CLK_UNHALTED.THREAD
CPU_CLK_UNHALTED.REF_TSC

The first counts retired instructions, the second number of actual cycles, and the last is what interests us. The description for Volume 3 of the Intel Software Developers manual is:

This event counts the number of reference cycles at the TSC rate when the core is not in a halt state and not in a TM stop-clock state. The core enters the halt state when it is running the HLT instruction or the MWAIT instruction. This event is not affected by core frequency changes (e.g., P states) but counts at the same frequency as the time stamp counter. This event can approximate elapsed time while the core was not in a halt state and not in a TM stopclock state.

So for a CPU-bound loop, I expect this value to be the same as the free-running TSC value read from rdstc, since they should diverge only for halted cycles instructions or what the "TM stopclock state" is.

I test this with the following loop (the entire standalone demo is available on github):

for (int i = 0; i < 100; i++) {
    PFC_CNT cnt[7] = {};

    int64_t start = nanos();
    PFCSTART(cnt);
    int64_t tsc =__rdtsc();
    busy_loop(CALIBRATION_LOOPS);
    PFCEND(cnt);
    int64_t tsc_delta   = __rdtsc() - tsc;
    int64_t nanos_delta = nanos() - start;

    printf(CPU_W "d" REF_W ".2f" TSC_W ".2f" MHZ_W ".2f" RAT_W ".6f\n",
            sched_getcpu(),
            1000.0 * cnt[PFC_FIXEDCNT_CPU_CLK_REF_TSC] / nanos_delta,
            1000.0 * tsc_delta / nanos_delta,
            1000.0 * CALIBRATION_LOOPS / nanos_delta,
            1.0 * cnt[PFC_FIXEDCNT_CPU_CLK_REF_TSC]/tsc_delta);
}

The only important thing in the timed region is busy_loop(CALIBRATION_LOOPS); which is simply a tight loop of volatile stores, which as compiled by gcc and clang executes at one cycle per iteration on recent hardware:

void busy_loop(uint64_t iters) {
    volatile int sink;
    do {
        sink = 0;
    } while (--iters > 0);
    (void)sink;
}

The PFCSTART and PFCEND commands read the CPU_CLK_UNHALTED.REF_TSC counter using libpfc. The __rdtsc() is an intrinsic that reads the TSC via the rdtsc instruction. Finally, we measure real time with nanos() which is simply:

int64_t nanos() {
    auto t = std::chrono::high_resolution_clock::now();
    return std::chrono::time_point_cast<std::chrono::nanoseconds>(t).time_since_epoch().count();
}

Yes, I don't issue a cpuid, and things aren't interleaved in an exact way, but the calibration loop is a full second so such nanosecond-scale issues just get diluted down to more or less nothing.

With TurboBoost enabled, here's are the first few results from a typical run on my i7-6700HQ Skylake CPU are:

CPU# REF_TSC   rdtsc Eff Mhz     Ratio
   0 2392.05 2591.76 2981.30  0.922946
   0 2381.74 2591.79 3032.86  0.918955
   0 2399.12 2591.79 3032.50  0.925660
   0 2385.04 2591.79 3010.58  0.920230
   0 2378.39 2591.79 3010.21  0.917663
   0 2355.84 2591.77 2928.96  0.908970
   0 2364.99 2591.79 2942.32  0.912492
   0 2339.64 2591.77 2935.36  0.902720
   0 2366.43 2591.79 3022.08  0.913049
   0 2401.93 2591.79 3023.52  0.926747
   0 2452.87 2591.78 3070.91  0.946400
   0 2350.06 2591.79 2961.93  0.906733
   0 2340.44 2591.79 2897.58  0.903020
   0 2403.22 2591.79 2944.77  0.927246
   0 2394.10 2591.79 3059.58  0.923723
   0 2359.69 2591.78 2957.79  0.910449
   0 2353.33 2591.79 2916.39  0.907992
   0 2339.58 2591.79 2951.62  0.902690
   0 2395.82 2591.79 3017.59  0.924389
   0 2353.47 2591.79 2937.82  0.908047

Here, REF_TSC is the fixed TSC performance counter as described above, and rdtsc is the result from the rdtsc instruction. Eff Mhz is the effective calculated true CPU frequency over the interval and is mostly shown for curiosity's sake and as a quick confirmation of how much turbo is kicking in. Ratio is the ratio of REF_TSC and rdtsc columns. I would expect this to be very close to 1, but in practice we see it hovers around 0.90 to 0.92 with a lot of variance (I've seen it as low as 0.8 on other runs).

Graphically it looks something like this²:

The rdstc call is returning nearly exact results¹, while the PMU TSC counter is all over the place, sometimes almost as low as 2300 MHz.

If I turn off turbo, however, the results are much more consistent:

CPU# REF_TSC   rdtsc Eff Mhz     Ratio
   0 2592.26 2592.25 2588.30  1.000000
   0 2592.26 2592.26 2591.11  1.000000
   0 2592.26 2592.26 2590.40  1.000000
   0 2592.25 2592.25 2590.43  1.000000
   0 2592.26 2592.26 2590.75  1.000000
   0 2592.26 2592.26 2590.05  1.000000
   0 2592.25 2592.25 2590.04  1.000000
   0 2592.24 2592.24 2590.86  1.000000
   0 2592.25 2592.25 2590.35  1.000000
   0 2592.25 2592.25 2591.32  1.000000
   0 2592.25 2592.25 2590.63  1.000000
   0 2592.25 2592.25 2590.87  1.000000
   0 2592.25 2592.25 2590.77  1.000000
   0 2592.25 2592.25 2590.64  1.000000
   0 2592.24 2592.24 2590.30  1.000000
   0 2592.23 2592.23 2589.64  1.000000
   0 2592.23 2592.23 2590.83  1.000000
   0 2592.23 2592.23 2590.49  1.000000
   0 2592.23 2592.23 2590.78  1.000000
   0 2592.23 2592.23 2590.84  1.000000
   0 2592.22 2592.22 2588.80  1.000000

Basically, the ratio is 1.000000 to 6 decimal places.

Graphically (with the Y axis scale forced to be the same as the previous graph):

Now the code is just running a hot loop, and there should be no hlt or mwait instructions, certainly nothing that would imply a variation of more than 10%. I can't say for sure what "TM stop-clock cycles" are, but I'd bet they are "thermal management stop-clock cycles", a trick used to temporarily throttle the CPU when reaches its maximum temp. However, I looked at the integrated thermistor readings, and I never saw the CPU break 60C, far below the 90C-100C where termal management kicks in (I think).

Any idea what this could be? Are there implied "halt cycles" to transition between different turbo frequencies? This definitely happens since the box is not quiet and so the turbo frequency is jumping up and down as other cores start and stop working on background stuff (the max turbo frequency depends directly on the number of active cores: on my box it is 3.5, 3.3, 3.2, 3.1 GHz for 1, 2, 3 or 4 cores active, respectively).

¹ In fact, for a while I really was getting exact results to two decimal places: 2591.97 MHz - iteration after iteration. Then something changed and I'm not exactly sure what and there is a small variation of about 0.1% in the rdstc results. One possibility is gradual clock adjustment, being made by the Linux timing subsystem to bring the local crystal derived time inline with the ntpd determined time. Perhaps, it is just a crystal drift - the last graph above shows a steady increase in the measured period of rdtsc each second.

² The graphs don't correspond to the same runs as the the values show in the text because I'm not going to update the graphs each time I change the text output format. The qualitative behavior is essentially the same on every run, however.

Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackoverflow.com/rooms/151368/discussion-on-question-by-beeonrope-turboboost-oddity-an-inconsistency-between). — Bhargav Rao, Aug 08 '17 at 10:54
Modern OSes sleep with `mwait`, rather than `hlt`. [Different register values for `mwait` put the CPU into different C-states](https://stackoverflow.com/a/44996041/224132). But yeah, same difference: OS-initiated sleeps shouldn't happen while a thread is ready to run. — Peter Cordes, Aug 08 '17 at 23:43
Hypothesis: the clock halts *while the CPU is changing frequency / voltage*, until it stabilizes at the new frequency. — Peter Cordes, Aug 09 '17 at 00:28
Indeed, that's consistent with what I've found. For example, if I run `stress --cpu 4` in the background of the test on my 4 core box, the vast variance majority of the variance goes away. The idea is that in this case you don't have any turbo ratios transitions since there are always 4 active cores. @PeterCordes — BeeOnRope, Aug 09 '17 at 00:32
SKL thermal limits were kicking in for me around 80 to 85C when OCing an i7-6700k with stock BIOS settings, IIRC (clock speed drops to 3.9GHz until it cools; maybe there's a further level of throttling where it stops the clock at 90 or 100C). Mysticial was saying that his SKL-X i9 desktop has configurable thermal-throttling limits that can be raised above their defaults. But anyway, yes, 60C should be far below any throttling limits on any motherboard. — Peter Cordes, Aug 09 '17 at 01:39
@peter also see my comment moved to chat: when running all cores active, the issue almost disappears, while if it was thermals related you'd expect the opposite. — BeeOnRope, Aug 09 '17 at 01:55
@PeterCordes Yeah, that's what I found as well. Regarding throttling, I also unearthed an awesome `MSR_CORE_PERF_LIMITS_REASONS` that does an excellent job of showing what's currently throttling. Currently my CPU package reports throttling on _Power Limiter 2_ and _Max Turbo Limit_, but occasionally also _Electrical Design Point_ and _Turbo Transition Attenuation_. The mere existence of the last one shows that the Intel people want to avoid excessive TurboBoost state transitions by adding hysteresis of some kind. This may or may not be configurable. — Iwillnotexist Idonotexist, Aug 09 '17 at 01:55
@IwillnotexistIdonotexist Do you mind if I ask you where did you find the `MSR_CORE_PERF_LIMITS_REASONS` register? I can't seem to find it in the Intel manuals :/ Thank you — Margaret Bloom, Aug 09 '17 at 12:44
@MargaretBloom Am close to finishing my answer for Bee; But it's in recent Intel SDMs. The name is not greppable by Ctrl+F; There might be a hidden nbsp in Intel's PDFs. Look at Table 2-29 for MSR 690H. — Iwillnotexist Idonotexist, Aug 09 '17 at 12:47
As a matter of fact, I just realized Intel's doc is wrong. It must be wrong if it simultaneously claims that `REF_TSC` is _"not affected by core frequency changes (e.g. P-States)"_ and that there could be _"performance degradation due to frequent operating ratio changes"_. I happen to believe the former is a lot less likely to be true than the latter, and the evidence bears it out. — Iwillnotexist Idonotexist, Aug 09 '17 at 14:45
To be fair, what Intel is saying there is that the REF TSC counts at a fixed frequency regardless of the p-state, as opposed to to other CLK counters which count cycles and hence vary with p-state. It's an important clarification on Intel's part because the first several generations of chips that introduced p-state frequency scale it had the opposite behavior wrt the TSC. I don't think mean to say that p-state will have zero effect on the counter when considering transition halt states. — BeeOnRope, Aug 09 '17 at 15:23
This data is for your Skylake CPU, right? I wanted to cite this on [How much delay is generated by this assembly code in linux](//stackoverflow.com/q/49924102) as a reason that delay loops are ridiculous. — Peter Cordes, Apr 20 '18 at 01:56
Yes, those numbers are from my Skylake i7-6700HQ. @PeterCordes — BeeOnRope, Apr 20 '18 at 02:16

Iwillnotexist Idonotexist · Accepted Answer · 2017-08-09T14:37:04.837

TL;DR

The discrepancy you are observing between RDTSC and REFTSC and is due to TurboBoost P-state transitions. During these transitions, most of the core, including the fixed-function performance counter REF_TSC, is halted for approximately 20000-21000 cycles (8.5us), but rdtsc continues at its invariant frequency. rdtsc is probably in an isolated power and clock domain because it is so important and because of its documented wallclock-like behaviour.

The `RDTSC-REFTSC` Discrepancy

The discrepancy manifests itself as a tendency for RDTSC to overcount REFTSC. The longer the program runs, the more positive the difference RDTSC-REFTSC tends to be. Over very long stretches it can mount as high as 1%-2% or even higher.

Of course, it has been observed by yourself already that the overcounting disappears when TurboBoost is disabled, which can be done as follows when using intel_pstate:

echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

But that does not tell us for sure that TurboBoost is at fault for the discrepancy; It could be that the higher P-States enabled by TurboBoost eat up the available headroom, causing thermal throttling and halts.

Possible Throttling?

TurboBoost is a dynamic frequency and voltage scaling solution to opportunistically take advantage of headroom in the operating envelope (thermal or electrical). When possible, TurboBoost will then scale up the core frequency and voltage of the processor beyond their nominal value, thus improving performance at the expense of higher power consumption.

The higher power consumption of course increases core temperature and power consumption. Eventually, some sort of limit will be hit, and TurboBoost will have to crank down performance.

TM1 Thermal Throttling?

I began by investigating whether the Thermal Control Circuitry (TCC) for Thermal Monitor 1 (TM1) or 2 (TM2) was causing thermal throttling. TM1 reduces power consumption by inserting TM stop-clock cycles, and these are one of the conditions documented to lead to a halt of REFTSC. TM2, on the other hand, does not gate the clock; It only scales the frequency.

I modified libpfc() to enable me to read select MSRs, specifically the IA32_PACKAGE_THERM_STATUS and IA32_THERM_STATUS MSRs. Both contain a read-only Status and a read-write, hardware-sticky Log flag for various thermal conditions:

(The IA32_PACKAGE_THERM_STATUS register is substantially the same)

While some of these bits were on occasion set (especially when blocking laptop air vents!), they did not seem to correlate with RDTSC overcounting, which would reliably occur regardless of thermal status.

Hardware Duty Cycling? C-State Residency?

Digging elsewhere in the SDM for stop-clock-like hardware I happened upon HDC (Hardware Duty Cycle), a mechanism by which the OS can manually request the CPU to operate only a fixed proportion of the time; HDC hardware implements this by running the processor for 1-15 clock cycles per 16-clock period, and force-idling it for the remaining 15-1 clock cycles of that period.

HDC offers very useful registers, in particular the MSRs:

IA32_THREAD_STALL: Counts the number of cycles stalled due to forced idling on this logical processor.
MSR_CORE_HDC_RESIDENCY: Same as above but for the physical processor, counts cycles when one or more logical processors of this core are force-idling.
MSR_PKG_HDC_SHALLOW_RESIDENCY: Counts cycles that the package was in C2 state and at least one logical processor was force-idling.
MSR_PKG_HDC_DEEP_RESIDENCY: Counts cycles that the package was in a deeper (which precisely is configurable) C-state and at least one logical processor was force-idling.

For further details refer to the Intel SDM Volume 3, Chapter 14, §14.5.1 Hardware Duty Cycling Programming Interface.

But my i7-4700MQ 2.4 GHz CPU doesn't support HDC, and so that was that for HDC.

Other Sources of Throttling?

Digging some more still in the Intel SDM I found a very, very juicy MSR: MSR_CORE_PERF_LIMIT_REASONS. This register reports a large number of very useful Status and sticky Log bits:

690H MSR_CORE_PERF_LIMIT_REASONS - Package - Indicator of Frequency Clipping in Processor Cores

Bit 0: PROCHOT Status

Bit 1: Thermal Status

Bit 4: Graphics Driver Status. When set, frequency is reduced below the operating system request due to Processor Graphics driver override.

Bit 5: Autonomous Utilization-Based Frequency Control Status. When set, frequency is reduced below the operating system request because the processor has detected that utilization is low.

Bit 6: Voltage Regulator Thermal Alert Status. When set, frequency is reduced below the operating system request due to a thermal alert from the Voltage Regulator.

Bit 8: Electrical Design Point Status. When set, frequency is reduced below the operating system request due to electrical design point constraints (e.g. maximum electrical current consumption).

Bit 9: Core Power Limiting Status. When set, frequency is reduced below the operating system request due to domain-level power limiting.

Bit 10: Package-Level Power Limiting PL1 Status. When set, frequency is reduced below the operating system request due to package-level power limiting PL1.

Bit 11: Package-Level Power Limiting PL2 Status. When set, frequency is reduced below the operating system request due to package-level power limiting PL2.

Bit 12: Max Turbo Limit Status. When set, frequency is reduced below the operating system request due to multi-core turbo limits.

Bit 13: Turbo Transition Attenuation Status. When set, frequency is reduced below the operating system request due to Turbo transition attenuation. This prevents performance degradation due to frequent operating ratio changes.

Bit 16: PROCHOT Log

Bit 17: Thermal Log

Bit 20: Graphics Driver Log

Bit 21: Autonomous Utilization-Based Frequency Control Log

Bit 22: Voltage Regulator Thermal Alert Log

Bit 24: Electrical Design Point Log

Bit 25: Core Power Limiting Log

Bit 26: Package-Level Power Limiting PL1 Log

Bit 27: Package-Level Power Limiting PL2 Log

Bit 28: Max Turbo Limit Log

Bit 29: Turbo Transition Attenuation Log

pfc.ko now supports this MSR, and a demo prints which of these log bits is active. The pfc.ko driver clears the sticky bits on every read.

I reran your experiments while printing the bits, and my CPU reports under very heavy load (all 4 cores/8 threads active) several limiting factors, including Electrical Design Point and Core Power Limiting. The Package-Level PL2 and Max Turbo Limit bits are always set on my CPU for reasons unknown to me. I also saw on occasion Turbo Transition Attenuation.

While none of these bits exactly correlated with the presence of the RDTSC-REFTSC discrepancy, the last bit gave me food for thought. The mere existence of Turbo Transition Attenuation implies that switching P-States has a substantial-enough cost that it must be rate-limited with some hysteresis mechanism. When I could not find an MSR that counted these transitions, I decided to do the next best thing - I'll use the magnitude of the RDTSC-REFTSC overcount to characterize the performance implications of a TurboBoost transition.

Experiment

The experiment setup is as follows. On my i7-4700MQ CPU, nominal speed 2.4GHz and max Turbo Speed 3.4 GHz, I'll offline all cores except 0 (the boot processor) and 3 (a convenient victim core not numbered 0 and not a logical sibling of 0). We will then ask the intel_pstate driver to give us a package performance of no less than 98% and no higher than 100%; This constrains the processor to oscillate between the second-highest and highest P-states (3.3 GHz and 3.4 GHz). I do this as follows:

echo   0 > /sys/devices/system/cpu/cpu1/online
echo   0 > /sys/devices/system/cpu/cpu2/online
echo   0 > /sys/devices/system/cpu/cpu4/online
echo   0 > /sys/devices/system/cpu/cpu5/online
echo   0 > /sys/devices/system/cpu/cpu6/online
echo   0 > /sys/devices/system/cpu/cpu7/online
echo  98 > /sys/devices/system/cpu/intel_pstate/min_perf_pct
echo 100 > /sys/devices/system/cpu/intel_pstate/max_perf_pct

I ran the demo application for 10000 samples at

1000,     1500,     2500,     4000,     6300,
10000,    15000,    25000,    40000,    63000,
100000,   150000,   250000,   400000,   630000,
1000000,  1500000,  2500000,  4000000,  6300000,
10000000, 15000000, 25000000, 40000000, 63000000

nanoseconds per add_calibration() executed at nominal CPU frequency (multiply the numbers above by 2.4 to get the actual argument to add_calibration()).

Results

This produces logs that look like this (case of 250000 nanos):

CPU 0, measured CLK_REF_TSC MHz        :          2392.56
CPU 0, measured rdtsc MHz              :          2392.46
CPU 0, measured add   MHz              :          3286.30
CPU 0, measured XREF_CLK  time (s)     :       0.00018200
CPU 0, measured delta     time (s)     :       0.00018258
CPU 0, measured tsc_delta time (s)     :       0.00018200
CPU 0, ratio ref_tsc :ref_xclk         :      24.00131868
CPU 0, ratio ref_core:ref_xclk         :      33.00071429
CPU 0, ratio rdtsc   :ref_xclk         :      24.00032967
CPU 0, core CLK cycles in OS           :                0
CPU 0, User-OS transitions             :                0
CPU 0, rdtsc-reftsc overcount          :              -18
CPU 0, MSR_IA32_PACKAGE_THERM_STATUS   : 000000008819080a
CPU 0, MSR_IA32_PACKAGE_THERM_INTERRUPT: 0000000000000003
CPU 0, MSR_CORE_PERF_LIMIT_REASONS     : 0000000018001000
        PROCHOT
        Thermal
        Graphics Driver
        Autonomous Utilization-Based Frequency Control
        Voltage Regulator Thermal Alert
        Electrical Design Point (e.g. Current)
        Core Power Limiting
        Package-Level PL1 Power Limiting
      * Package-Level PL2 Power Limiting
      * Max Turbo Limit (Multi-Core Turbo)
        Turbo Transition Attenuation
CPU 0, measured CLK_REF_TSC MHz        :          2392.63
CPU 0, measured rdtsc MHz              :          2392.62
CPU 0, measured add   MHz              :          3288.03
CPU 0, measured XREF_CLK  time (s)     :       0.00018192
CPU 0, measured delta     time (s)     :       0.00018248
CPU 0, measured tsc_delta time (s)     :       0.00018192
CPU 0, ratio ref_tsc :ref_xclk         :      24.00000000
CPU 0, ratio ref_core:ref_xclk         :      32.99983509
CPU 0, ratio rdtsc   :ref_xclk         :      23.99989006
CPU 0, core CLK cycles in OS           :                0
CPU 0, User-OS transitions             :                0
CPU 0, rdtsc-reftsc overcount          :               -2
CPU 0, MSR_IA32_PACKAGE_THERM_STATUS   : 000000008819080a
CPU 0, MSR_IA32_PACKAGE_THERM_INTERRUPT: 0000000000000003
CPU 0, MSR_CORE_PERF_LIMIT_REASONS     : 0000000018001000
        PROCHOT
        Thermal
        Graphics Driver
        Autonomous Utilization-Based Frequency Control
        Voltage Regulator Thermal Alert
        Electrical Design Point (e.g. Current)
        Core Power Limiting
        Package-Level PL1 Power Limiting
      * Package-Level PL2 Power Limiting
      * Max Turbo Limit (Multi-Core Turbo)
        Turbo Transition Attenuation
CPU 0, measured CLK_REF_TSC MHz        :          2284.69
CPU 0, measured rdtsc MHz              :          2392.63
CPU 0, measured add   MHz              :          3151.99
CPU 0, measured XREF_CLK  time (s)     :       0.00018121
CPU 0, measured delta     time (s)     :       0.00019036
CPU 0, measured tsc_delta time (s)     :       0.00018977
CPU 0, ratio ref_tsc :ref_xclk         :      24.00000000
CPU 0, ratio ref_core:ref_xclk         :      33.38540919
CPU 0, ratio rdtsc   :ref_xclk         :      25.13393301
CPU 0, core CLK cycles in OS           :                0
CPU 0, User-OS transitions             :                0
CPU 0, rdtsc-reftsc overcount          :            20548
CPU 0, MSR_IA32_PACKAGE_THERM_STATUS   : 000000008819080a
CPU 0, MSR_IA32_PACKAGE_THERM_INTERRUPT: 0000000000000003
CPU 0, MSR_CORE_PERF_LIMIT_REASONS     : 0000000018000000
        PROCHOT
        Thermal
        Graphics Driver
        Autonomous Utilization-Based Frequency Control
        Voltage Regulator Thermal Alert
        Electrical Design Point (e.g. Current)
        Core Power Limiting
        Package-Level PL1 Power Limiting
      * Package-Level PL2 Power Limiting
      * Max Turbo Limit (Multi-Core Turbo)
        Turbo Transition Attenuation
CPU 0, measured CLK_REF_TSC MHz        :          2392.46
CPU 0, measured rdtsc MHz              :          2392.45
CPU 0, measured add   MHz              :          3287.80
CPU 0, measured XREF_CLK  time (s)     :       0.00018192
CPU 0, measured delta     time (s)     :       0.00018249
CPU 0, measured tsc_delta time (s)     :       0.00018192
CPU 0, ratio ref_tsc :ref_xclk         :      24.00000000
CPU 0, ratio ref_core:ref_xclk         :      32.99978012
CPU 0, ratio rdtsc   :ref_xclk         :      23.99989006
CPU 0, core CLK cycles in OS           :                0
CPU 0, User-OS transitions             :                0
CPU 0, rdtsc-reftsc overcount          :               -2
CPU 0, MSR_IA32_PACKAGE_THERM_STATUS   : 000000008819080a
CPU 0, MSR_IA32_PACKAGE_THERM_INTERRUPT: 0000000000000003
CPU 0, MSR_CORE_PERF_LIMIT_REASONS     : 0000000018001000
        PROCHOT
        Thermal
        Graphics Driver
        Autonomous Utilization-Based Frequency Control
        Voltage Regulator Thermal Alert
        Electrical Design Point (e.g. Current)
        Core Power Limiting
        Package-Level PL1 Power Limiting
      * Package-Level PL2 Power Limiting
      * Max Turbo Limit (Multi-Core Turbo)
        Turbo Transition Attenuation

I made several observations about the logs, but one stood out:

For nanos < ~250000, there is negligible RDTSC overcounting. For nanos > ~250000, one may reliably observe overcounting clock cycle quanta of just over 20000 clock cycles. But they are not due to User-OS transitions.

Here is a visual plot:

Saturated Blue Dots: 0 standard deviations (close to mean)

Saturated Red Dots: +3 standard deviations (above mean)

Saturated Green Dots: -3 standard deviations (below mean)

There is a marked difference before, during and after roughly 250000 nanoseconds of sustained decrementing.

Nanos < 250000

Before the threshold, the CSV logs look like this:

24.00,33.00,24.00,-14,0,0
24.00,33.00,24.00,-20,0,0
24.00,33.00,24.00,-4,3639,1
24.00,33.00,24.00,-20,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,-14,0,0
24.00,33.00,24.00,-14,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,-44,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,-14,0,0
24.00,33.00,24.00,-20,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,-20,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,12,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,-20,0,0
24.00,33.00,24.00,32,3171,1
24.00,33.00,24.00,-20,0,0
24.00,33.00,24.00,10,0,0

Indicating a TurboBoost ratio perfectly stable at 33x, an RDTSC that counts in synchrony with REFTSC at 24x the rate of REF_XCLK (100 MHz), negligible overcounting, typically 0 cycles spent in the kernel and thus 0 transitions into the kernel. Kernel interrupts take approximately 3000 reference cycles to service.

Nanos == 250000

At the critical threshold, the log contains clumps of 20000 cycle overcounts, and the overcounts correlate very well with non-integer estimated multiplier values between 33x and 34x:

24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,2,0,0
24.00,33.00,24.00,22,0,0
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,-2,0,0
24.00,33.05,25.11,20396,0,0
24.00,33.38,25.12,20212,0,0
24.00,33.39,25.12,20308,0,0
24.00,33.42,25.12,20296,0,0
24.00,33.43,25.11,20158,0,0
24.00,33.43,25.11,20178,0,0
24.00,33.00,24.00,-4,0,0
24.00,33.00,24.00,20,3920,1
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,-4,0,0
24.00,33.44,25.13,20396,0,0
24.00,33.46,25.11,20156,0,0
24.00,33.46,25.12,20268,0,0
24.00,33.41,25.12,20322,0,0
24.00,33.40,25.11,20216,0,0
24.00,33.46,25.12,20168,0,0
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,22,0,0

Nanos > 250000

The TurboBoost from 3.3 GHz to 3.4 GHz now happens reliably. As the nanos increase, the logs are filled with roughly integer multiples of 20000-cycle quanta. Eventually there are so many nanos that the Linux scheduler interrupts become permanent fixtures, but preemption is easily detected with the performance counters, and its effect is not at all similar to the TurboBoost halts.

24.00,33.75,24.45,20166,0,0
24.00,33.78,24.45,20302,0,0
24.00,33.78,24.45,20202,0,0
24.00,33.68,24.91,41082,0,0
24.00,33.31,24.90,40998,0,0
24.00,33.70,25.30,58986,3668,1
24.00,33.74,24.42,18798,0,0
24.00,33.74,24.45,20172,0,0
24.00,33.77,24.45,20156,0,0
24.00,33.78,24.45,20258,0,0
24.00,33.78,24.45,20240,0,0
24.00,33.77,24.42,18826,0,0
24.00,33.75,24.45,20372,0,0
24.00,33.76,24.42,18798,4081,1
24.00,33.74,24.41,18460,0,0
24.00,33.75,24.45,20234,0,0
24.00,33.77,24.45,20284,0,0
24.00,33.78,24.45,20150,0,0
24.00,33.78,24.45,20314,0,0
24.00,33.78,24.42,18766,0,0
24.00,33.71,25.36,61608,0,0
24.00,33.76,24.45,20336,0,0
24.00,33.78,24.45,20234,0,0
24.00,33.78,24.45,20210,0,0
24.00,33.78,24.45,20210,0,0
24.00,33.00,24.00,-10,0,0
24.00,33.00,24.00,4,0,0
24.00,33.00,24.00,18,0,0
24.00,33.00,24.00,2,4132,1
24.00,33.00,24.00,44,0,0

Conclusions

The TurboBoost machinery is responsible for the discrepancy in RDTSC-REFTSC. This discrepancy can be used to determine that a TurboBoost state transition from 3.3 GHz to 3.4 GHz takes approximately 20500 reference clock cycles (8.5us), and is triggered no later than about 250000 ns (250us; 600000 reference clock cycles) after entry into add_reference(), when the processor decides that the workload is sufficiently intense as to deserve a frequency-voltage scaling.

Future Work

More research needs to be done to determine how the transition cost varies with frequency, and whether the hardware selecting the power state can be tuned. Of particular interest to me are "Turbo Attenuation Units", hints of which I've seen in the far reaches of the web. Perhaps the Turbo hardware has a configurable timewindow? Currently the ratio of time spend deciding to time spent transitioning is 30:1 (600us:20us). Can it be tuned?

Did the presence/non-presence of missing TSC_REF cycles in your last (CSV output) experiments correlate with the setting of the `Turbo Transition Attenuation` bit? I guess one question is how to actually read that bit. Presumably you can't read the "current state" in a useful way since when the CPU is halted for a transition you can't read (unlike say non-halt methods of reducing perf). So maybe you are supposed to clear the sticky bit and read it after? Awesome results! — BeeOnRope, Aug 09 '17 at 15:19
@BeeOnRope Actually I did read that bit and the hardware did set it more often at higher thread counts. It's theoretically possible to have unhalted code see it set (and my code did see it set) because if active it means the processor is refusing to scale up because in the near past it scaled down and the hysteresis timer didn't expire yet. — Iwillnotexist Idonotexist, Aug 09 '17 at 18:51
Ah, I see - so it's "because if active it means the processor is refusing to scale up because in the near past it scaled down and the hysteresis timer didn't expire yet", which is different than I thought. Where did you learn about that? — BeeOnRope, Aug 09 '17 at 18:58
@BeeOnRope It's from the doc of the CORE_PERF_LIMIT_REASONS MSR, _"Bit 13: Turbo Transition Attenuation Status. When set, frequency is reduced below the operating system request due to Turbo transition attenuation. This prevents performance degradation due to frequent operating ratio changes."_ For me it means the hardware is below where it would be given all other envelope conditions, but the hysteresis timer detected too many transitions in the recent past and is rejecting an upscaling now while we're in a lower P-state; It acts as a sort of oscillation damper. — Iwillnotexist Idonotexist, Aug 09 '17 at 19:10
Cool, you had written about it above but I misread it. Really interesting. — BeeOnRope, Aug 09 '17 at 19:21
@BeeOnRope I just thought of a way to measure the transition time with surgical precision - poll rdtsc and reftsc continuously and simultaneously and attempt to detect the "preemption" by TB hardware by an apparent leap of 20k cycles of rdtsc relative to reftsc. The question is how to ensure the "preemption" not happen between the rdtsc and reftsc reads. — Iwillnotexist Idonotexist, Aug 09 '17 at 19:28
Yeah that should work. It doesn't seem to matter too much to me when the preemption happens exactly, if you are polling `TSC_REF` (A) and `rdtsc` (B) back and forth like `ABABA` it doesn't seem to matter much if the preemption (`x`) happens like `ABxABA` or `ABAxBA` since it either case you'll see a large `B -> B` gap, while all of the `A -> A` gaps should look about normal, or if not normal it would be one of the two `A -> A` gaps, and you can check both. In a way, you don't even need `ABABA` but just `BBBB` (i.e., just poll `rdstc`). — BeeOnRope, Aug 09 '17 at 20:34
That will catch all cases where the process is halted for any reason, but you can always rule out the other ones, e.g., by checking the interrupt count. — BeeOnRope, Aug 09 '17 at 20:48
I didn't quite undestand "True that! It means that, if lucky, "preemption" could occur between consecutive rdtsc's.". If you are issuing nothing but a stream of `rdtsc` preemption will _always_ occur between some pair of `rdtsc`, right? — BeeOnRope, Aug 09 '17 at 20:49
Let's continue this in chat since moderators seem to hate long comment threads. — BeeOnRope, Aug 09 '17 at 21:06
Hardware duty-cycling is only used to hit TDP levels below what they can do with the max-efficiency slowest clock speed. Probably even regular ULV laptop chips (ix-6xxxU) don't support it, but maybe the Core-M CPUs that go down to 3.5W TDP-down would. See http://myeventagenda.com/sessions/0B9F4191-1C29-408A-8B61-65D7520025A8/7/5#sessionID=155 (IDF2015 talk audio + slides from Efraim Rotem, the lead client power architect for Skylake.) 16 cycles is very short, maybe it's not that duty-cycling. @BeeOnRope: there's some maybe-relevant stuff about SKL's other freq-switching decisions in there. — Peter Cordes, Aug 10 '17 at 04:37

Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC

1 Answers1

TL;DR

The `RDTSC-REFTSC` Discrepancy

Possible Throttling?

TM1 Thermal Throttling?

Hardware Duty Cycling? C-State Residency?

Other Sources of Throttling?

690H MSR_CORE_PERF_LIMIT_REASONS - Package - Indicator of Frequency Clipping in Processor Cores

Experiment

Results

Nanos < 250000

Nanos == 250000

Nanos > 250000

Conclusions

Future Work

Linked

Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC

1 Answers1

TL;DR

The RDTSC-REFTSC Discrepancy

Possible Throttling?

TM1 Thermal Throttling?

Hardware Duty Cycling? C-State Residency?

Other Sources of Throttling?

690H MSR_CORE_PERF_LIMIT_REASONS - Package - Indicator of Frequency Clipping in Processor Cores

Experiment

Results

Nanos < 250000

Nanos == 250000

Nanos > 250000

Conclusions

Future Work

Linked

The `RDTSC-REFTSC` Discrepancy