I've used a pause
loop to keep the CPU frequency up while benchmarking sometimes, but I'd be reluctant to waste a whole core on an infinite loop for production use. Instead, tune your CPU power-management settings, or recommend that your users do so in your install guide. Or have your program (or a separate script) check the settings and recommend changes.
In C++, infinite loops without I/O or a volatile
or atomic
operations are undefined behaviour. The compiler can remove them (https://eel.is/c++draft/intro.progress#1), and clang does that aggressively. It emits no asm instructions for your coreEngager()
function, so execution just falls into whatever function is next in the binary.
This C++ rule was intended to help compilers when the loop condition is non-trivial so they might have trouble proving a loop does terminate.
Fun fact: ISO C has a different rule, that if the loop condition is a constant expression like 1
, infinite loops are well-defined. But IIRC, clang has a bug in that case, still applying the C++ rule.
To actually fix this, put an empty asm volatile("")
statement inside the loop to keep the compiler happy regardless of target, without actually running extra instructions.
#if defined(__x86_64__) || defined(__i386__)
#include <immintrin.h>
#define SPIN() _mm_pause() // heat up the CPU less
#else
#define SPIN() /**/
#endif
void coreEngager() {
while (true) {
asm(""); // counts as a volatile op for infinite-loop UB.
SPIN(); // x86 pause or whatever
}
}
Godbolt shows that it compiles as desired:
# clang16 -O3 for x86-64
coreEngager_buggy(): # zero asm instructions for your orig source!
coreEngager_fixed():
.LBB1_1:
pause
jmp .LBB1_1
# clang16 -O3 for AArch64
coreEngager_buggy():
coreEngager_fixed():
.LBB1_1:
b .LBB1_1
If there's an AArch64 equivalent to x86 pause
, use that. There might not be since there haven't been SMT AArch64 CPUs (multiple logical cores sharing a physical core), and they don't have memory-order mis-speculation in spin loops because they don't have x86's strongly-ordered memory model. Except M1 in x86-compat memory model mode; maybe Apple has a pause
-like instruction?
I assume you're tuning for CPUs like Skylake where memory-bound code can let the CPU downclock, e.g. down to 2.7GHz instead of 3.9GHz on my i7-6700k with hardware P-state management set to balance_power
or balance_performance
(unlike with it set to full performance
) with Linux's energy_performance_preference
. (Slowing down CPU Frequency by imposing memory stress)
Or do you have other threads that fully sleep sometimes, and without a spinning thread they come up at idle frequency initially? (Why does this delay-loop start to run faster after several iterations with no sleep?)
Related: