CPU Throttling no longer works at Xcode in Release mode

Question

I have XCode C++ project that is performance sensitive, and I am using trick with CPU throttling. Essentialy I add this code:

// function to occupy a thread for an infinite amount of time
void coreEngager() {
    while (true) {}
}

// call it in the background thread
std::thread t1(coreEngager);

// call it once the work is done
t1.detach();

This little trick allows to boost calculation speed about 50% which is very significant in my case. But there is a problem I recently discovered - this code gives a crash if I try to run the project in Release mode. Crash happens in coreEngager function. I didn't have that issue before, and I do not have that issue now on Linux and Windows. Can you please advise what has changed or how to make CPU throttling work on MacOS?

In C++ a thread is guaranteed to [make progress](https://eel.is/c++draft/intro.progress). `while (true) {}` does not make progress and the compiler is allowed to optimized that away. — Thomas Weller, Jun 16 '23 at 16:26
"what has changed" - to answer that, you look up the Release Notes of your compiler - whatever version you were using before and the version you are using now. — Thomas Weller, Jun 16 '23 at 16:30
"This little trick allows to boost calculation speed about 50%" - you expect 1 core to run a thread at 100% core consumption and you expect your calculation to be 50% faster? I wouldn't rely on that - in contrast: your CPU should get hot and throttle. You also use the term throttling. Throttling means that the CPU gets slower, not faster, right? — Thomas Weller, Jun 16 '23 at 16:35
Having an infinite empty loop in C++ is ill formed code (undefined behavior). I would have to look up the details, but this exception is there to allow the compiler to optimize code. So basically `void coreEngager()` is NOT valid C++ and it is allowed to crash — Pepijn Kramer, Jun 16 '23 at 16:42
Can you explain to us why you think this is making your code run faster? — Pepijn Kramer, Jun 16 '23 at 16:43
@ThomasWeller To be more specific it is undefined behavior : https://en.cppreference.com/w/cpp/language/ub -> infinite loops without side effects. — Pepijn Kramer, Jun 16 '23 at 16:46
i am guessing you are pushing the cpu to overclock itself instead of going idle .... i would put a volatile variable that gets checked inside the loop on every iteration so the compiler won't optimize it away ... or a better solution is fix the main code to actually use the other thread in the calculations. — Ahmed AEK, Jun 16 '23 at 16:48
If your goal is to force the CPU (cores) to not go to an idle state, lower P-state or lower frequency or something along theses lines, then it would be a better idea to use your operating system to configure the CPU the way you want it to. E.g. as a base line, there is probably some generic performence vs. powersave setting somewhere in your OS interface. — user17732522, Jun 16 '23 at 16:50
@PepijnKramer: that's new to me. Where in the standard is that mentioned? The whole page https://eel.is/c++draft/intro.progress does not contain the word "undefined". — Thomas Weller, Jun 16 '23 at 16:59
At least put `_mm_pause()` in the loop `#ifdef __x86_64__` so you're not heating up the CPU so much doing nothing. I assume you're tuning for CPUs like Skylake where memory-bound code can let the CPU downclock, e.g. down to 2.7GHz instead of 3.9GHz with hardware P-state management set to `balance_power` or `balance_performance` (unlike with it set to full `performance`) with Linux's `energy_performance_preference`. Or do you have other threads that fully sleep sometimes, and without a spinning thread they come up at idle frequency initially? — Peter Cordes, Jun 16 '23 at 17:01
@ThomasWeller: https://eel.is/c++draft/intro.progress#1 "The implementation may assume that any thread will eventually do one of the following:" is equivalent to it being UB if the program doesn't. Fun fact: ISO C has a different rule, that if the loop condition is a constant expression like `1`, infinite loops are well-defined. But IIRC, clang has a bug in that case, still applying the C++ rule. — Peter Cordes, Jun 16 '23 at 17:03
@PeterCordes I hate the C++ spec for their wording. If that's true, you can't grep it for undefined behavior at all. It's already bad that they use "undefined behavior" and "behavior is undefined" , but how on earth shall I ever find all instances of UB then? — Thomas Weller, Jun 16 '23 at 17:05
@PepijnKramer It is UB and meant to be used for optimization, but not really the kind of optimization that would remove a known-infinite loop. The issue is that you need to know whether a loop terminates for certain optimizations and that is often difficult/impossible to prove. So with that guarantee the compiler can skip trying to verify that the loop will end. And at least Clang is known to be aggressive about this producing e.g.: https://godbolt.org/z/hWK4dhzP7 — user17732522, Jun 16 '23 at 17:06
@ThomasWeller See the Clang example above. It outputs nonsense after the intended output and then segfaults because Clang considers it impossible that the function can ever be called, so it removed all instructions from its body, even the `ret`, so that arbitrary code after it in the executable is executed instead. — user17732522, Jun 16 '23 at 17:06
@user17732522: just because one compiler generates UB is not necessarily a proof that it *is* UB according the standard. But ok, I'll believe it and update my misunderstanding. — Thomas Weller, Jun 16 '23 at 17:10
Looks like even the standards comittee had some internal discussion on this : [N1509](https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1509.pdf) and [N1528](https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1528.htm). — Pepijn Kramer, Jun 16 '23 at 17:41
@ThomasWeller: One compiler treating something as undefined is actually pretty strong evidence that it's allowed to do so. Not proof of course, but compiler bugs are rare, especially in code this simple; compilers breaking code that's actually valid is rare. The reverse is very common, though: UB code that happens to work on at least one compiler, especially in a debug build, is not rare at all. — Peter Cordes, Jun 16 '23 at 21:13
Wow, so much attention to this question. Before you get all judgemental, this code is not mine, it's part of the legacy, and I am not in favour of this decision. Thank you for your valid comments, I appreciate that and these all have something valuable for them @PepijnKramer because I know :) I was surprised to see such a practice, so I measured time with and without it. Yes, it is faster and it's faster significantly, as I said. — Eugene Alexeev, Jun 18 '23 at 09:51
As a a matter of fact - I got so curious about it that I implemented a very similar practice to iOS just to see if it works. In iOS you can do it much safely, because you have a convenient API that gives you CPU information, and you can enable/disable CPU engagement based on the current load. And surprise surprise - I got 30-35% win of processing speed. I am not advising to do that in your performance sensitive iOS project, it was an experiment and never published to AppStore — Eugene Alexeev, Jun 18 '23 at 09:54

Peter Cordes · Answer 1 · 2023-06-16T22:20:03.660

I've used a pause loop to keep the CPU frequency up while benchmarking sometimes, but I'd be reluctant to waste a whole core on an infinite loop for production use. Instead, tune your CPU power-management settings, or recommend that your users do so in your install guide. Or have your program (or a separate script) check the settings and recommend changes.

In C++, infinite loops without I/O or a volatile or atomic operations are undefined behaviour. The compiler can remove them (https://eel.is/c++draft/intro.progress#1), and clang does that aggressively. It emits no asm instructions for your coreEngager() function, so execution just falls into whatever function is next in the binary.

This C++ rule was intended to help compilers when the loop condition is non-trivial so they might have trouble proving a loop does terminate.

Fun fact: ISO C has a different rule, that if the loop condition is a constant expression like 1, infinite loops are well-defined. But IIRC, clang has a bug in that case, still applying the C++ rule.

To actually fix this, put an empty asm volatile("") statement inside the loop to keep the compiler happy regardless of target, without actually running extra instructions.

#if defined(__x86_64__) || defined(__i386__)
#include <immintrin.h>
#define SPIN() _mm_pause()  // heat up the CPU less
#else
#define SPIN()  /**/
#endif

void coreEngager() {
    while (true) {
        asm("");       // counts as a volatile op for infinite-loop UB.
        SPIN();        // x86 pause or whatever
    }
}

Godbolt shows that it compiles as desired:

# clang16 -O3 for x86-64
coreEngager_buggy():         # zero asm instructions for your orig source!
coreEngager_fixed():
.LBB1_1:
        pause
        jmp     .LBB1_1

# clang16 -O3 for AArch64
coreEngager_buggy():
coreEngager_fixed():
.LBB1_1:
        b       .LBB1_1

If there's an AArch64 equivalent to x86 pause, use that. There might not be since there haven't been SMT AArch64 CPUs (multiple logical cores sharing a physical core), and they don't have memory-order mis-speculation in spin loops because they don't have x86's strongly-ordered memory model. Except M1 in x86-compat memory model mode; maybe Apple has a pause-like instruction?

I assume you're tuning for CPUs like Skylake where memory-bound code can let the CPU downclock, e.g. down to 2.7GHz instead of 3.9GHz on my i7-6700k with hardware P-state management set to balance_power or balance_performance (unlike with it set to full performance) with Linux's energy_performance_preference. (Slowing down CPU Frequency by imposing memory stress)

Or do you have other threads that fully sleep sometimes, and without a spinning thread they come up at idle frequency initially? (Why does this delay-loop start to run faster after several iterations with no sleep?)

Perhaps move the last paragraph up to the top. IMHO the CPU settings should be changed before implementing a software solution. For performance measurements, I even disable TurboBoost in the BIOS (or set CPU to 99% max in Windows) so that I get more consistent measurements (and reduce the fan noise, which is a nice side effect, especially on Laptops) — Thomas Weller, Jun 16 '23 at 22:06
@ThomasWeller: Thanks, good suggestion. For microbenchmarking, measuring user-space core clock cycles (e.g. `perf stat --all-user ./a.out`) often works well for the things I want to measure. It helps that my desktop can sustain its limited single-core turbo indefinitely (4.2 GHz up from 4.0 GHz non-turbo), especially if I leave EPP = balance_performance so it actually only ever clocks up to 3.9 GHz so the fans stay totally silent. Laptops where the peak power is so much higher than the sustained cooling are different. — Peter Cordes, Jun 16 '23 at 22:24
Thank you very much for such an extended expertise! I will go through it next week and share my update — Eugene Alexeev, Jun 18 '23 at 09:59

CPU Throttling no longer works at Xcode in Release mode

1 Answers1