Implementing spin-lock without XCHG?

Question

C++ spin-lock can be easily implemented using std::atomic_flag, it can be coded roughly (without special features) like following:

std::atomic_flag f = ATOMIC_FLAG_INIT;

while (f.test_and_set(std::memory_order_acquire)); // Acquire lock
// Here do some lock-protected work .....
f.clear(std::memory_order_release); // Release lock

One can see online assembly, it shows that acquiring is implemented atomically through XCHG instruction.

Also as one can see on uops.info (screen here) that XCHG may take up to 30 CPU cycles on quite popular Skylake. This is quite slow.

Overall speed of spin lock can be measure through such program.

Is it possible to implement spin locking without XCHG? Main concern is about speed, not about just using another instruction.

What is the fastest possible spin lock? Is it possible to make it 10 cycles instead of 30? And 5 cycles? Maybe some probabilistic spin-lock that runs fast on average?

It should be implemented in a strict way, meaning that in 100% cases it correctly protects piece of code and data. If it is probabilistic then it should run probable time but yet protects 100% correctly after each run.

Main purpose for such spin lock for me is to protect very tiny operations inside multiple threads, that run a dozen or two of cycles, hence 30 cycles delay is too much overhead. Of course one can say that I can use atomics or other lock-free techniques to implement all operations. But this techniques is not possible for all cases, and also will take to much work to strictly implement in huge code base for many classes and methods. Hence something generic like regular spin lock is also needed.

Sure, there are other ways, eg `lock bts`, but they take many cycles too. Locked operations are inherently expensive. — Nate Eldredge, Nov 16 '21 at 03:16
There do exist ways to do a spinlock with only loads and stores, thus not needing a locked RMW, e.g. [Peterson's algorithm](https://en.wikipedia.org/wiki/Peterson%27s_algorithm). But it needs sufficiently strict memory ordering that you are going to need an explicit fence somewhere (StoreLoad reordering would break it), and that has a cost too. — Nate Eldredge, Nov 16 '21 at 03:18
(In fact, on many CPUs the fastest available sequential consistency fence is `xchg` or `lock add`; see https://stackoverflow.com/questions/4232660/which-is-a-better-write-barrier-on-x86-lockaddl-or-xchgl?noredirect=1&lq=1) — Nate Eldredge, Nov 16 '21 at 03:26
@NateEldredge Maybe there exist some probabilistic algorithm that is faster? Meaning that it has probable time which is faster on average. Still probabilistic algorithm should be strict regarding protection correctness in all cases, it should be only probable time-wise. — Arty, Nov 16 '21 at 03:34
It's hard to imagine. Fundamentally, every time you lock, you have to do a load (to see if the lock is available), a conditional jump (to avoid the critical section if it's not available), and usually a store (to actually take the lock). If any of those are skipped then the lock cannot have been correctly taken, so they have to execute with 100% probability. And those three alone are already taking pretty much all the budgeted cycles. — Nate Eldredge, Nov 16 '21 at 03:39
@NateEldredge Can you tell such thing - if one core writes to cache, through any memory-write instruction. Does it mean that on very next instruction (next cycle) after write this written data is already available on all other cores if they read this memory location? Meaning that there is not a single extra cycle that is spent for propagating written data to other cores? It is due to cache coherence algorithm, as I understand. Or cache-coherence needs several extra cycles? — Arty, Nov 16 '21 at 03:45
You might want to read about MESI and similar coherence algorithms. I don't know the exact real-life costs, but roughly speaking, if this core already owns that cache line, writing to it is as cheap as possible. If some other core owns it, then we have to ask them to give it to us, which takes some time. — Nate Eldredge, Nov 16 '21 at 03:47
@NateEldredge Do you know if anything other than cache and RAM is shared between cores? For example do there exist any special registers that are shared between cores and that are very fast to read/write? Or interaction between cores can be done only through common cache? — Arty, Nov 16 '21 at 03:50
No, no free lunch that I know of. When such "side channels" have existed they've usually been security holes that had to be plugged. Besides memory and external devices, the only other form of inter-core communication I know is the inter-processor interrupt (IPI) and that is very expensive. — Nate Eldredge, Nov 16 '21 at 03:55
Probably your time and energy is better spent trying to improve your algorithms so that less locking is needed in the first place. Optimizing the lock itself does not seem like it's going to be low-hanging fruit. — Nate Eldredge, Nov 16 '21 at 04:00
@NateEldredge Yes, optimizing other parts of algorithm is an option. But spin lock is needed two. Imagine a huge code base that needs a lot of classes that need multi-threading protection. There are many classes that have tiny operations. Of course if operations (methods) are heavy then `std::mutex` is quite enough. But if they are tiny, dozen of cycles, then of course optimized spin lock will be great for them. If over whole code base there will be just 3-4 such tiny methods then there is no difficulty to optimize these methods. But if there are many then generic Fast spin lock is needed. — Arty, Nov 16 '21 at 04:07
Large amounts of engineering effort have already been spent on `std::mutex` on all mainstream systems. It might provide some things you don't necessarily need, like fairness, and fallback to OS-assisted sleep / wakeup, but those are pretty minor. If you want something that works like a lock, you're not going to get significantly cheaper than `std::mutex` on most modern systems. If there was a better way, `std::mutex` would already be doing it. e.g. using HLE lock-elision prefixes on CPUs with working TSX support. (Unfortunately Intel disabled the HLE part of TSX in microcode recently.) — Peter Cordes, Nov 16 '21 at 04:38
Keep in mind that modern x86 CPUs can do out-of-order exec of *independent* ALU instructions around `xchg` or `lock cmpxchg`, so a microbenchmark that only hammers away on `xchg` might not represent the real cost; i.e. some of that ~18 cycle throughput cost (AMD Zen2 / Intel Skylake) can overlap with other work. You can make sure that *unlocking* is only using a release store, not another atomic RMW, though. — Peter Cordes, Nov 16 '21 at 04:42
@PeterCordes On my Windows 64-bit laptop I measured `std::mutex` and it appeared to be `120 ns`. But same measurement for spin-lock similar to code above gave `30 ns`. Of course standard mutex may give some special features like sleeping thread. But right now as I need spin lock for tiny operations spanning dozen or two of nanoseconds, then of course I don't need any sleep.More than that my spin lock measured on massive amount of operations proved that not a single extra feature is needed,it is totally enough and 4 times faster than std::mutex.My Question is about whether I can speedup further — Arty, Nov 16 '21 at 04:46
IDK about Windows, maybe their implementation isn't as good as GNU/Linux. https://godbolt.org/z/Gej6YGGhn replaces your hand-rolled spinlock with `std::lock_guard`. On Godbolt's Skylake-AVX512 / Cascade-Lake servers of unknown frequency, it reports about 18ns vs. hand-rolled ~9 ns, so maybe it's using `xchg` to unlock as well, for some reason. I think I've seen that when disassembling a library, maybe in Linux pthreads. (Not a well controlled experiment since those AWS servers aren't idle, but over a few runs it's about that.) — Peter Cordes, Nov 16 '21 at 04:55
@Atry, did you use `std::mutex` from Visual Studio? It is known to be bad, as it is implemented in some ugly backward-compatible way. `std::shared_mutex` should be better. — Alex Guteniev, Nov 16 '21 at 11:15
@AlexGuteniev Actually I used Clang on Windows. But it is using MSVC's std library. So it means yes, I used MSVC's mutex. Can you tell if I use `libc++` std library (coming from LLVM) on Windows will it give a better mutex implementation? If not so then what implementation should be taken on Windows? BTW, if I compile same benchmark on Linux then on my machine it takes `30-40 ns` compared to `120 ns` on Windows. — Arty, Nov 16 '21 at 11:17
@PeterCordes Always wanted to ask such question - do you know what is the difference between `std::unique_lock`/`std::lock_guard`/`std::scoped_lock`? And which one is better and more correct to use if I just want to acquire/release mutex in a block? — Arty, Nov 16 '21 at 11:20
@Arty: Not off the top of my head; I think `lock_guard` is the simplest if you just want mutual exclusion with all other threads trying to do the same thing. `std::unique_lock` looks like (https://en.cppreference.com/w/cpp/thread/unique_lock) its purpose it to allow transferring ownership of the lock to another scope (e.g. a caller) via C++ move semantics. And scoped_lock is for taking multiple mutex locks with a deadlock-avoidance order (e.g. all threads lock from lowest address to highest, and bail out if they can't get all the locks they wanted.) So as for the actual locking, no diff. — Peter Cordes, Nov 16 '21 at 11:28
@Arty, sorry I don't know about the state of libc++ on Windows. I suggest benchmarking `std::shared_mutex` from MSVC and take that as a baseline, if libc++ `std::mutex` is about the same or better, then it is good. — Alex Guteniev, Nov 16 '21 at 11:32
@Arty, I haven't found a question about `mutex` / `shared_mutex` in MSVC, so I asked and answered myself https://stackoverflow.com/q/69990339/2945027 — Alex Guteniev, Nov 16 '21 at 13:51
@AlexGuteniev Thanks you created great answer by your link! UpVoted both question and answer. — Arty, Nov 16 '21 at 13:58

score 6 · Accepted Answer · edited Nov 16 '21 at 10:39

Is it possible to implement spin locking without XCHG?

Yes. For 80x86, you can lock bts or lock cmpxchg or lock xadd or ...

What is the fastest possible spin lock?

Possible interpretations of "fast" include:

a) Fast in the uncontended case. In this case it's not going to matter very much what you do because most of the possible operations (exchanging, adding, testing...) are cheap and the real costs are cache coherency (getting the cache line containing the lock into the "exclusive" state in the current CPU's cache, possibly including fetching it from RAM or other CPUs' caches) and serialization.

b) Fast in the contended case. In this case you really need a "test without lock; then test & set with lock" approach. The main problem with a simple spinloop (for the contended case) is that when multiple CPUs are spinning the cache line will be rapidly bouncing from one CPU's cache to the next and consuming a huge amount of inter-connect bandwidth for nothing. To prevent this, you'd have a loop that tests the lock state without modifying it so that the cache line can remain in all CPUs caches as "shared" at the same time while those CPUs are spinning.

But note that testing read-only to start with can hurt the un-contended case, resulting in more coherency traffic: first a share-request for the cache line which will only get you MESI S state if another core had recently unlocked, and then an RFO (Read For Ownership) when you do try to take the lock. So best practice is probably to start with an RMW, and if that fails then spin read-only with pause until you see the lock available, unless profiling your code on the system you care about shows a different choice is better.

c) Fast to exit the spinloop (after contention) when the lock is acquired. In this case CPU can speculatively execute many iterations of the loop, and when the lock becomes acquired all the CPU has to drain those "speculatively execute many iterations of the loop" which costs a little time. To prevent that you want a pause instruction to prevent many iterations of the loop/s from being speculatively executed.

d) Fast for other CPUs that don't touch the lock. For some cases (hyper-threading) the core's resources are shared between logical processors; and when one logical process is spinning it consumes resources that the other logical processor could've used to get useful work done (especially for the "spinlock speculatively executes many iterations of the loop" situation). To minimize this you need a pause in the spinloop/s (so that the spinning logical processor doesn't consume as much of the core's resources and the other logical processor in the core can get more useful work done).

e) Minimum "worst case time to acquire". With a simple lock, under contention, some CPUs or threads can be lucky and always get the lock while other CPUs/threads are very unlucky and take ages to get the lock; and the "worst case time to acquire" is theoretically infinite (a CPU can spin forever). To fix that you need a fair lock - something to ensure that only the thread that has been waiting/spinning for the longest amount of time is able to acquire the lock when it is released. Note that it's possible to design a fair lock such that each thread spins on a different cache line; which is a different way to solve the "cache line bouncing between CPUs" problem I mentioned in "b) Fast in the contended case".

f) Minimum "worst case until lock released". This has to involve the length of the worst critical section; but in some situations may also include the cost of any number IRQs, the cost of any number of task switches and the time the code isn't using any CPU. It's entirely possible to have a situation where a thread acquires the lock then the scheduler does a thread switch; then many CPUs all spin (wasting a huge amount of time) on a lock that can not be released (because the lock holder is the only one that can release the lock and it isn't even using any CPU). The way to fix/improve this is to disable the scheduler and IRQs; which is fine in kernel code, but "likely impossible for security reasons" in normal user-space code. This is also the reason why spinlocks should probably never be used in user-space (and why user-space should probably use a mutex where the thread is put in a "blocked waiting for lock" state and not given CPU time by the scheduler until/unless the thread actually can acquire the lock).

Note that making it fast for one possible interpretation of "fast" can make it slower/worse for other interpretations of "fast". For example; the speed of the uncontended case is made worse by everything else.

Example Spinlock

This example is untested, and written in (NASM syntax) assembly.

;Input
; ebx = address of lock

;Initial optimism in the hope the lock isn't contended
spinlock_acquire:
    lock bts dword [ebx],0      ;Set the lowest bit and get its previous value in carry flag
                                ;Did we actually acquire it, i.e. was it previously 0 = unlocked?
    jnc .acquired               ; Yes, done!

;Waiting (without modifying) to avoid "cache line bouncing"

.spin:
    pause                       ;Reduce resource consumption
                                ; and avoid memory order mis-speculation when the lock becomes available.
    test dword [ebx],1          ;Has the lock been released?
    jne .spin                   ; no, wait until it was released

;Try to acquire again

    lock bts dword [ebx],0      ;Set the lowest bit and get its previous value in carry flag
                                ;Did we actually acquire it?
    jc .spin                    ; No, go back to waiting

.acquired:

Spin-unlock can be just mov dword [ebx], 0, not lock btr, because you know you own the lock and that has release semantics on x86. You could read it first to catch double-unlock bugs.

Notes:

a) lock bts is a little slower than other possibilities; but it doesn't interfere with or depend on the other 31 bits (or 63 bits) of the lock, which means that those other bits can be used for detecting programming mistakes (e.g. store 31 bits of "thread ID that currently holds lock" in them when the lock is acquired and check them when the lock is released to auto-detect "Wrong thread releasing lock" and "Lock being released when it was never acquired" bugs) and/or used for gathering performance information (e.g. set bit 1 when there's contention so that other code can scan periodically to determine which locks are rarely contended and which locks are heavily contended). Bugs with the use of locks are often extremely insidious and hard to find (unpredictable and unreproducible "Heisenbugs" that disappear as soon as you try to find them); so I have a preference for "slower with automatic bug detection".

b) This is not a fair lock, which means its not well suited to situations where contention is likely.

c) For memory; there's a compromise between memory consumption/cache misses, and false sharing. For rarely contended locks I like to put the lock in the same cache line/s as the data the lock protects, so that the acquiring the lock means that the data the lock holder wants is already in the cache (and no subsequent cache miss occurs). For heavily contended locks this causes false sharing and should be avoided by reserving the whole cache line for the lock and nothing else (e.g. by adding 60 bytes of unused padding after the 4 bytes used by the actual lock, like in C++ alignas(64) struct { std::atomic<int> lock; }; ). Of course a spinlock like this shouldn't be used for heavily contended locks so its reasonable to assume that minimizing memory consumption (and not having any padding, and not caring about false sharing) makes sense.

Main purpose for such spin lock for me is to protect very tiny operations inside multiple threads, that run a dozen or two of cycles, hence 30 cycles delay is too much overhead

In that case I'd suggest trying to replace locks with atomics, block-free algorithms, and lock-free algorithms. A trivial example is tracking statistics, where you might want to do lock inc dword [number_of_chickens] instead of acquiring a lock to increase "number_of_chickens".

Beyond that it's hard to say much - for one extreme, the program could be spending most of its time doing work without needing locks and the cost of locking may have almost no impact on overall performance (even though acquire/release is more expensive than the tiny critical sections); and for the other extreme the program could be spending most of its time acquiring and releasing locks. In other words, the cost of acquiring/releasing locks is somewhere between "irrelevant" and "major design flaw (using far too many locks and needing to redesign the entire program)".

Thanks for expanded answer! UpVoted. You gave a good idea that while acquiring lock it is better to loop just with reading flag, not exchanging, this may save considerable time because doesn't need cache line to be switched between cores. Do you have by any chance ready-made implementation of Spin Lock with all ideas that you described? Maybe some GitHub link? — Arty, Nov 16 '21 at 05:10
@Arty: Added an example spinlock in there with some implementation notes (intended as an example - probably not code you'd want to "cut & paste"). — Brendan, Nov 16 '21 at 09:08
@Arty: One downside to that pessimistic strategy is that it can lead to more coherence traffic, if the lock is available but the cache line is not already exclusively owned. (i.e. another thread unlocked it recently so it's still hot in their private cache). A read will generate a share request that only gets MESI S state, and then the `xchg` or `lock cmpxchg` (or `lock bts` if you want) will also miss, and only then generate an RFO to get Exclusive or Modified state. See [this Q&A](https://stackoverflow.com/q/63008857/224132), and maybe [this](https://stackoverflow.com/a/37246263/224132). — Peter Cordes, Nov 16 '21 at 09:14
I was under the impression that the usual best practice was to attempt an RMW first, and only *spin-wait* read-only to avoid having all cores hammering on the line (even with some backoff in the `pause`/`cmp` loops.) But until a year or two ago, I hadn't realized the disadvantage of making the very first check read-only, which delays the fast path in the ideal case. — Peter Cordes, Nov 16 '21 at 09:17
So the final conclusion is that except `XCHG` or `lock ...` instructions there is nothing that can be used to implement spin lock? Actually I was looking for some algorithm that is fast and that doesn't need locking memory. Are you aware if any such special algorithm exist that works without locking? — Arty, Nov 16 '21 at 13:36
Brendan, accepting your answer, but maybe you can give me last answer - what do you think about my [previous comment](https://stackoverflow.com/questions/69983323/implementing-spin-lock-without-xchg#comment123722761_69983889) ? Is there any algorithmical way if CPU doesn't support XCHG/lock or anything similar? Maybe @PeterCordes also can answer my previous comment? — Arty, May 10 '22 at 13:32
@Arty: double-checked locking can give mutual exclusion without ever needing an atomic RMW, but it still needs a full memory barrier so it's actually *more* expensive than just having one thread claim the lock with an RMW and the others immediately know they lost the race. (Your only options for memory barriers on x86 are `lock`ed operations, or an `mfence` which is even *more* expensive on many recent CPUs.) Getting exclusive ownership of the cache line to be able to commit a store in the first place is the expensive part if there's contention, moreso than `xchg` of a byte. — Peter Cordes, May 10 '22 at 13:36
@PeterCordes So final conslusion is that just Algorithmically (by mathematics only) it is NOT possible to implement spin-lock? Only with special hardware support, like XCHG/lock/mfence instructions? So if some rare CPU model has only mathematical operations inside but no special instructions like XCHG/lock/mfence, then there is no chance for this CPU to implement spin-lock or std::atomic_flag? — Arty, May 10 '22 at 13:54
@Arty: If you had a hypothetical machine without any way to get sequential consistency, and without any atomic RMWs, yes, you'd be screwed. You also couldn't implement Java or C++ on such a machine, because they do require that it be possible to recover sequential consistency. If you had a machine that is *always* slow and sequentially consistent (like 80386 SMP systems), you wouldn't need any special instructions to implement [Dekker's algorithm](https://en.wikipedia.org/wiki/Dekker%27s_algorithm). IDK what you mean about math, though; Math can describe parallel algorithms and synchronization — Peter Cordes, May 10 '22 at 14:02

Implementing spin-lock without XCHG?

1 Answers1