Why do MWAIT Power Management hints cause premature wakeups?

Question

For university I'm currently experimenting with the MONITOR/MWAIT instruction pair. Specifically, I want to measure how much energy the CPU uses in different scenarios and have already programmed a relatively well working test setup. As part of the setup, I have all the cores enter MWAIT and then use a NMI to wake them up again after a specified time. So far everything was working fine, but now I wanted to test how the Power Management hints affect the power consumption.

Unfortunately, every hint apart from 0 seems to cause MWAIT to not wait for the NMI, but to wake up on its' own after 3-4 ms. As far as I understand the documentation, the Power Management hints should not have any impact on when execution is continued after the MWAIT, so this is quite strange. And since I still haven't made any progress even after spending a few hours on this problem, I thought maybe someone here has some idea what is going on!

Here is how I use MONITOR/MWAIT in my code:

volatile int dummy;

void do_mwait() {
    asm volatile("monitor;" ::"a"(&dummy), "c"(0), "d"(0));
    asm volatile("mwait;" ::"a"(0x10), "c"(0));
}

This is obviously just a small excerpt of the Linux kernel module I've written, but should contain all the important points. dummy is a variable that is never used outside of what you can see here. It only exists so that I have a valid address to pass to monitor. do_mwait() is the function that gets executed on every core available while I do my measurements. As I said, just exchanging the 0x10 in the second line of do_mwait() with 0 makes it work the way I expect.

Because the behaviour and supported features of MONITOR/MWAIT depend on the specific CPU model, here are all the relevant (I think) parts of cpuid on my test machine. As far as I see, all necessary features should be supported:

CPU 0:
   vendor_id = "GenuineIntel"
   version information (1/eax):
      processor type  = primary processor (0)
      family          = 0x6 (6)
      model           = 0xc (12)
      stepping id     = 0x3 (3)
      extended family = 0x0 (0)
      extended model  = 0x3 (3)
      (family synth)  = 0x6 (6)
      (model synth)   = 0x3c (60)
      (simple synth)  = Intel Core (unknown type) (Haswell C0) {Haswell}, 22nm
   ...
   feature information (1/ecx):
      ...
      MONITOR/MWAIT                           = true
      ...
   ...
   MONITOR/MWAIT (5):
      smallest monitor-line size (bytes)       = 0x40 (64)
      largest monitor-line size (bytes)        = 0x40 (64)
      enum of Monitor-MWAIT exts supported     = true
      supports intrs as break-event for MWAIT  = true
      number of C0 sub C-states using MWAIT    = 0x0 (0)
      number of C1 sub C-states using MWAIT    = 0x2 (2)
      number of C2 sub C-states using MWAIT    = 0x1 (1)
      number of C3 sub C-states using MWAIT    = 0x2 (2)
      number of C4 sub C-states using MWAIT    = 0x4 (4)
      number of C5 sub C-states using MWAIT    = 0x0 (0)
      number of C6 sub C-states using MWAIT    = 0x0 (0)
      number of C7 sub C-states using MWAIT    = 0x0 (0)
   ...
   brand = "Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz"
   ...

I hope this is enough context. Please tell me if I need to share additional information. And thanks in advance for any input, even if it's just an (educated) guess!

I assume you've disabled other interrupts (`cli`), so only an NMI should wake the CPU? Did you disable the kernel's "NMI watchdog" that uses a PMU hardware performance counter to generate periodic NMIs, or is that the NMI you're waiting for? (`/proc/sys/kernel/nmi_watchdog` = 0.) — Peter Cordes, May 17 '23 at 20:48
Interrupts are disabled, yes. Unfortunately I don't have access to my test system right now, so I can't confirm it by looking at `/proc/sys/kernel/nmi_watchdog`, but I'm pretty sure the NMI watchdog is disabled. Otherwise it would be a problem in both cases, not just if I use the hints, right? The NMI I'm waiting for is generated by the HPET by the way. — ObiBabobi, May 17 '23 at 22:12
Yes, that's correct. I missed the detail where you said there was one hint value where you saw it sleeping the expected time. (Which is presumably a lot longer than 4 ms, so it couldn't just be luck of timing?) — Peter Cordes, May 17 '23 at 22:20
I've tested it for sleep durations of 10, 100 and 1000 ms, and that all works without issue (as long as I use a hint of 0). — ObiBabobi, May 17 '23 at 22:22
From https://www.felixcloutier.com/x86/mwait , EAX=0 requests C0 state, EAX=0x10 is C1. And ECX=0 means to wake on interrupts even if they're masked (`cli`), but all other ECX values are reserved. So if you want to not wake early, you may have to turn off interrupt sources like the timer interrupt? — Peter Cordes, May 17 '23 at 22:32
Or if it's not a masked interrupt waking the core, I wonder if deeper C states wake up when the physical core needs to power up because of the other logical core? Do you have hyperthreading enabled? If so, try turning it off in the BIOS or with a boot option. (If it is waking from C1 or deeper due to the other logical core, that might make sense for real use-cases: most of the cost of waking has already been paid, might as well give the OS a chance to see if there's anything to do. But that'd mean the core has to partition its SMT resources, not give full speed to the woken task... Maybe not) — Peter Cordes, May 17 '23 at 22:35
EAX=0 should request C1 as far as I understand the documentation, and EAX=0x10 should request C2. And ECX=0 means that it should ignore masked interrupts, no? Also keep in mind that ECX doesn't change between the case where it works and the one where it doesn't, so that can't really be the issue I think. — ObiBabobi, May 17 '23 at 22:42
Oh, yes I misread that. EAX=0 is C1, EAX=0x0F is C0 (which IIRC is what `hlt` uses). And yes, good point about ECX=0 even for the case that does sleep a long time. Also, yes, ECX=0 means to ignore masked interrupts. That's bit-position 0, not value 0. Oops, that makes more sense :P With it set to 0, it should ignore masked interrupts. — Peter Cordes, May 17 '23 at 22:51
Hyperthreading is enabled, but I always enter the described `do_mwait()` function on both threads of a core simultaneously. So that shouldn't be the problem either. I can still play around with disabling hyperthreading when I next have the chance, but I suspect it is not going to change much. — ObiBabobi, May 17 '23 at 22:53
Have you checked for Haswell errata? https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/4th-gen-core-family-desktop-specification-update.pdf from 2020 mentions `mwait` in two places, but none that would explain this. HSD60 mentions that it won't enter C6 or deeper when PCIe links are disabled. Perhaps the microcode workaround/fix for some deeper sleep states causing hangs or weirdness involves limiting their sleep duration? I searched for "C-state" in the PDF and found various things that just say the BIOS can contain a workaround", implying microcode update — Peter Cordes, May 17 '23 at 22:56
So no obvious errata that I saw, but it's possible this might be a real effect on your hardware, not a problem with your code. — Peter Cordes, May 17 '23 at 22:57
I may soon get the chance to test my code on a newer CPU. If it works there, it really might just be the actual behaviour deviating from what is described in the documentation and what you would expect. I will report back then (though it might take a few weeks). — ObiBabobi, May 17 '23 at 23:07
If you edit your question with a link to your test code, someone else with a different CPU might be curious and want to run the experiment themselves. — Peter Cordes, May 17 '23 at 23:09
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/253708/discussion-between-obibabobi-and-peter-cordes). — ObiBabobi, May 17 '23 at 23:21
From Intel's manual (https://www.felixcloutier.com/x86/mwait): ***Implementation-specific conditions** may result in an interrupt causing the processor to exit the implementation-dependent-optimized state even if interrupts are masked and ECX[0] = 0.* This might be an example of that. — Peter Cordes, May 18 '23 at 00:18

Peter Cordes · Accepted Answer · 2023-05-18T00:39:30.460

From Intel's manual (https://www.felixcloutier.com/x86/mwait)

The following cause the processor to exit the implementation-dependent-optimized state: a store to the address range armed by the MONITOR instruction, an NMI or SMI, a debug exception, a machine check exception, the BINIT# signal, the INIT# signal, and the RESET# signal. Other implementation-dependent events may also cause the processor to exit the implementation-dependent-optimized state.

...

Implementation-specific conditions may result in an interrupt causing the processor to exit the implementation-dependent-optimized state even if interrupts are masked and ECX[0] = 0.

This might be an example of either of those.

Obviously that's not very satisfying, and it would be nice if I knew where to look to find any microarchitectural explanation of why that might only happen with deeper sleep states.

Like perhaps in C1 state (EAX=0), enough is still powered on to check the interrupt mask, but in deeper sleep states even some of the internal interrupt controller stuff is powered down?

You'd hope that they could check without powering up the whole core and resuming execution, but maybe that's something a microcode update enabled as a workaround for some design problem that was discovered later. Intel's Haswell errata list (https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/4th-gen-core-family-desktop-specification-update.pdf) does mention some C-state related problems where "the BIOS can contain a workaround", which actually means the BIOS can include a microcode update that changes CPU behaviour. Intel often includes ways for microcode to disable certain optimizations or features in case problems are found, and perhaps the only way to fix some lockup in an odd corner case was to always have the core wake from C2 or deeper on interrupts, even when they're supposed to be masked.

That's pure guesswork as to the cause, but Intel does clearly document that what you're seeing is possible.

Other possibilities include SMI (system-management-mode interrupts), but hopefully your system doesn't fire those regularly.

Why do MWAIT Power Management hints cause premature wakeups?

1 Answers1