I'm guessing is due to the other often quoted factoid that using PAUSE somehow signals to the processor that it's in the midst of a spinlock.
Yes, pause
lets the CPU avoid memory-order mis-speculation when leaving a read-only spin-wait loop, which is how you should spin to avoid creating contention for the thread trying to unlock that location. (Don't spam xchg
). See also:
If you have to wait for another core to change something in memory, it doesn't help much to check much more frequently than the inter-core latency, especially if you'll stall right when you finally do see the change you've been waiting for.
Dealing with pause
being slow (newer Intel) or fast (AMD and old Intel)
If you have code that uses multiple pause
instructions between checking the shared value, you should change the code to do fewer.
See also Why have AMD-CPUs such a silly PAUSE-timing with Brendan's suggestion of checking rdtsc
in spin loops to better adapt to unknown delays from pause
on different CPUs.
Basically try to make your workload not as sensitive to pause
latency. That may also mean trying to avoid waiting for data from other threads as much, or having other useful work to do until a lock becomes available.
Alternatives, IDK if this is lower wakeup latency than spinning with pause
On CPUs new enough to have the WAITPKG extension (Tremont or Alder Lake / Sapphire Rapids), umonitor
/ umwait
could be viable to wait in user-space for a memory location to change, like spinning but the CPU wakes itself up when it sees cache coherency traffic about a change, or something like that. Although that may be slower than pause
if it has to enter a sleep state.
(You can ask umwait
to only go into C0.1, not C0.2, with bit 0 of the register you specify, and EDX:EAX as a TSC deadline. Intel says a C0.2 sleep state improves performance of the other hyperthread, so presumably that means switching back to single-core-active mode, de-partitioning the store-buffer, ROB, etc., and having to wait for re-partitioning before this core can wake up. But C0.1 state doesn't do that.)
Even in the worst case, pause
is only about 140 core clock cycles. That's still much faster than a Linux system call on a modern x86-64, especially with Spectre / Meltdown mitigation. (Thousands to tens of thousands of clock cycles, up from a couple hundred just for syscall
+ sysret
, let alone calling schedule()
and maybe running something else.)
So if you're aiming to minimize wakeup latency at the expense of wasting CPU time spinning longer, nanosleep
is not an option. If might be good for other use-cases though, as a fallback after spinning on pause
a couple times.
Or use futex
to sleep on a value changing or on a notify from another process. (It doesn't guarantee that the kernel will use monitor
/ mwait
to sleep until a change, it will let other tasks run. So you do need to have the unlocking thread make a futex
system call if any waiters had gone to sleep with futex
. But you can still make a lightweight mutex that avoids any system calls in the locker and unlocker if there isn't contention and no threads go to sleep.)
But at that point, with futex
you're probably reproducing what glibc's pthread_mutex
does.