What's a good alternative to PAUSE for use in the implementation of a spinlock?

Question

I am working on making a fiber-based job system for my latest project which will depend on the use of spinlocks for proper functionality. I had intended to use the PAUSE instruction as that seems to be the gold-standard for the waiting portion of your average modern spinlock. However, on doing some research into implementing my own fibers, I came across the fact that the cycle duration of PAUSE on recent machines has increased to an adverse extent.

I found this out from here, where it says, quoting the Intel Optimization Manual, "The latency of PAUSE instruction in prior generation microarchitecture is about 10 cycles, whereas on Skylake microarchitecture it has been extended to as many as 140 cycles," and "As the PAUSE latency has been increased significantly, workloads that are sensitive to PAUSE latency will suffer some performance loss."

As a result, I'd like to find an alternative to the PAUSE instruction for use in my own spinlocks. I've read that in the past, PAUSE has been preferred due to it somehow saving on energy usage which I'm guessing is due to the other often quoted factoid that using PAUSE somehow signals to the processor that it's in the midst of a spinlock. I'm also guessing that this is on the other end of the spectrum power-wise to doing some dummy calculation for the desired number of cycles.

Given this, is there a best-case solution that comes close to PAUSE's apparent energy efficiency while having the flexibility and low-cycle count as a repeat 'throwout' calculation?

Why not just write readable, standard compliant C++ and turn on your compilers optimizer and let it worry about it? In short; "why bother"? Are you *really* in a situation where you can measure an actual difference between handwritten asm and compiler optimized code and where it actually matters? — Jesper Juhl, Nov 02 '22 at 20:14
@JesperJuhl AFAIK C++ doesn't include a user-accessible spinlock implementation. (As for "why use a spinlock at all?", the answer is usually that you are operating in a real-time context where standard synchronization primitives like mutexes aren't allowed) — Jeremy Friesner, Nov 02 '22 at 20:16
@JeremyFriesner Real time context need for spinlocks might be a reasonable explanation. *Might be* - but I don't even see that in the question. — Jesper Juhl, Nov 02 '22 at 20:18
*"workloads that are sensitive to PAUSE latency will suffer some performance loss."* And you intend to introduce workloads that *are* sensitive? How does one do that? — BoP, Nov 02 '22 at 20:46
https://opensource.apple.com/source/clamav/clamav-158/clamav.Bin/clamav-0.98/win32/3rdparty/pthreads/pthread_spin_lock.c.auto.html which you can also make with `std::atomic_flag` and calling `test_and_set` on it, then using `nanosleep` if necessary. — Brandon, Nov 02 '22 at 22:29
@JesperJuhl Currently all of the code, including the spinlock, is written in C++. The reading I've done on the subject has "PAUSE():" synonymous with "_mm_pause();" though perhaps there's some contention that I'm unaware about between this interpretation of PAUSE and what led to the belief I was writing in ASM. — Zacary Wengdom, Nov 03 '22 at 13:04
@BoP This is my first rodeo with any sort of serious approach to concurrency, so I figured I would seek after something that can handle whatever workload I throw at it rather than some suffering due to the PAUSE latency. However, after thinking about it some more, workloads that would suffer from this would occur when the amount of time doing useful work approaches the time taken in the spinlock, right? In which case the choice of work I'm having this system do would be more problematic than the choice of whether or not to use PAUSE / which substitute I swap it out for. — Zacary Wengdom, Nov 03 '22 at 13:13

Peter Cordes · Accepted Answer · 2022-11-03T00:42:38.980

I'm guessing is due to the other often quoted factoid that using PAUSE somehow signals to the processor that it's in the midst of a spinlock.

Yes, pause lets the CPU avoid memory-order mis-speculation when leaving a read-only spin-wait loop, which is how you should spin to avoid creating contention for the thread trying to unlock that location. (Don't spam xchg). See also:

How does x86 pause instruction work in spinlock *and* can it be used in other scenarios?
Does cmpxchg write destination cache line on failure? If not, is it better than xchg for spinlock? - spin read-only (with pause) if the first atomic RMW fails to take the lock. You do want to start with an RMW attempt so you don't get a Shared copy of the cache line and then have to wait for another off-core request; if the first access is an RMW like xchg or lock cmpxchg, the first request the core makes will be an RFO.
Locks around memory manipulation via inline assembly a minimal x86 asm spinlock using pause (but no fallback to OS-assisted sleep/wake, so inappropriate if the locking thread might ever sleep while holding the lock.)

If you have to wait for another core to change something in memory, it doesn't help much to check much more frequently than the inter-core latency, especially if you'll stall right when you finally do see the change you've been waiting for.

Dealing with `pause` being slow (newer Intel) or fast (AMD and old Intel)

If you have code that uses multiple pause instructions between checking the shared value, you should change the code to do fewer.

See also Why have AMD-CPUs such a silly PAUSE-timing with Brendan's suggestion of checking rdtsc in spin loops to better adapt to unknown delays from pause on different CPUs.

Basically try to make your workload not as sensitive to pause latency. That may also mean trying to avoid waiting for data from other threads as much, or having other useful work to do until a lock becomes available.

Alternatives, IDK if this is lower wakeup latency than spinning with `pause`

On CPUs new enough to have the WAITPKG extension (Tremont or Alder Lake / Sapphire Rapids), umonitor / umwait could be viable to wait in user-space for a memory location to change, like spinning but the CPU wakes itself up when it sees cache coherency traffic about a change, or something like that. Although that may be slower than pause if it has to enter a sleep state.

(You can ask umwait to only go into C0.1, not C0.2, with bit 0 of the register you specify, and EDX:EAX as a TSC deadline. Intel says a C0.2 sleep state improves performance of the other hyperthread, so presumably that means switching back to single-core-active mode, de-partitioning the store-buffer, ROB, etc., and having to wait for re-partitioning before this core can wake up. But C0.1 state doesn't do that.)

Even in the worst case, pause is only about 140 core clock cycles. That's still much faster than a Linux system call on a modern x86-64, especially with Spectre / Meltdown mitigation. (Thousands to tens of thousands of clock cycles, up from a couple hundred just for syscall + sysret, let alone calling schedule() and maybe running something else.)

So if you're aiming to minimize wakeup latency at the expense of wasting CPU time spinning longer, nanosleep is not an option. If might be good for other use-cases though, as a fallback after spinning on pause a couple times.

Or use futex to sleep on a value changing or on a notify from another process. (It doesn't guarantee that the kernel will use monitor / mwait to sleep until a change, it will let other tasks run. So you do need to have the unlocking thread make a futex system call if any waiters had gone to sleep with futex. But you can still make a lightweight mutex that avoids any system calls in the locker and unlocker if there isn't contention and no threads go to sleep.)

But at that point, with futex you're probably reproducing what glibc's pthread_mutex does.

Thank you for the in-depth response! – Zacary Wengdom Nov 03 '22 at 13:27 — Zacary Wengdom, Nov 03 '22 at 13:27

score 0 · Answer 2 · answered Nov 02 '22 at 20:18

0

I can only refer you to this talk by Fedor Pikus. In this he claims that (on linux) nanosleep is generally the fastest on most systems. The real answer though, is to benchmark different implementations, and see whichever is fastest!

answered Nov 02 '22 at 20:18

Mikdore

699
5
16

Upvote for the "benchmark it" bit. – Jesper Juhl Nov 02 '22 at 20:29
1

If a `pause` instruction is too slow on Skylake, at up to 140 core clock cycles, there's no way a `syscall` is going to be acceptable!!! The round trip to kernel mode and back is at least a couple hundred cycles. But with Spectre / Meltdown mitigation enabled, the kernel is doing to do a bunch of stuff in the `syscall` entry point like changing page tables and invalidating branch history, taking thousands or tens of thousands of cycles for the cheapest system call. – Peter Cordes Nov 02 '22 at 23:48
If you were going to make any system call in a spinlock, you'd probably use `futex` to let the kernel wake you up when the value of that memory location changed. (Or when another process did a futex notify upon unlocking, if the kernel wasn't sleeping on a `monitor`/`mwait` on that address when it changed, if that's even something Linux will do.) Speaking of which, on CPUs new enough to have it, `umonitor` / `umwait` could be viable to wait for a memory location to change in user-space. – Peter Cordes Nov 02 '22 at 23:51

What's a good alternative to PAUSE for use in the implementation of a spinlock?

2 Answers2

Dealing with `pause` being slow (newer Intel) or fast (AMD and old Intel)

Alternatives, IDK if this is lower wakeup latency than spinning with `pause`

Linked

What's a good alternative to PAUSE for use in the implementation of a spinlock?

2 Answers2

Dealing with pause being slow (newer Intel) or fast (AMD and old Intel)

Alternatives, IDK if this is lower wakeup latency than spinning with pause

Linked

Dealing with `pause` being slow (newer Intel) or fast (AMD and old Intel)

Alternatives, IDK if this is lower wakeup latency than spinning with `pause`