Is my spin lock implementation correct and optimal?

Question

I'm using a spin lock to protect a very small critical section. Contention happens very rarely so a spin lock is more appropriate than a regular mutex.

My current code is as follows, and assumes x86 and GCC:

volatile int exclusion = 0;

void lock() {
    while (__sync_lock_test_and_set(&exclusion, 1)) {
        // Do nothing. This GCC builtin instruction
        // ensures memory barrier.
    }
}

void unlock() {
    __sync_synchronize(); // Memory barrier.
    exclusion = 0;
}

So I'm wondering:

Is this code correct? Does it correctly ensure mutual exclusion?
Does it work on all x86 operating systems?
Does it work on x86_64 too? On all operating systems?
Is it optimal?
- I've seen spin lock implementations using compare-and-swap but I'm not sure which is better.
- According to the GCC atomic builtins documentation (http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html) there's also __sync_lock_release. I'm not an expert on memory barriers so I'm not sure whether it's okay for me to use this instead of __sync_synchronize.
- I'm optimizing for the case in which there's no contention.

I do not care at all about contention. There may be 1, maybe 2 other threads trying to lock the spin lock once every few days.

I'm curious why you aren't using pthread mutexes. In the no-contention case, a lock or unlock is only a couple of instructions. — Jay Conrod, Sep 05 '09 at 14:57
I'm with Jay - if your purpose is to speed up your app, rather than to learn how to implement a spinlock, then before worrying about whether this is correct, test whether it's actually any faster than a mutex. If not, who cares whether it's correct? — Steve Jessop, Sep 05 '09 at 15:28
I've already tested. The current spinlock *is* faster than a mutex, at least on Linux. I'm not avoiding posix mutexes for no good reason. — Hongli, Sep 05 '09 at 16:07
Came across your post and I'm wondering if you ever tested this against POSIX spinlocks? — Michael Mior, Sep 30 '10 at 19:02
Yes, in my code I use OS X spinlocks when possible, then fallback to GCC atomic builtins, then fallback to pthread mutexes. OS X spinlocks perform about the same as the GCC atomic builtins and much faster than pthread mutexes in the uncontended case. — Hongli, Sep 30 '10 at 20:34
Never ever spin without a condition. That's a literal live-lock. — Luiz Felipe, Jul 12 '22 at 17:34

score 22 · Answer 1 · answered Sep 05 '09 at 14:50

22

Looks fine to me. Btw, here is the textbook implementation that is more efficient even in the contended case.

void lock(volatile int *exclusion)
{
    while (__sync_lock_test_and_set(exclusion, 1))
        while (*exclusion)
            ;
}

answered Sep 05 '09 at 14:50

sigjuice

28,661
12
68
93

Anyone reading this in the future, its not efficient in anything with enough number of cores. Also, this implementation is wrong it doesn't do a yield, you must yield to the operating system eventually, otherwise you have starvation and a live-locked system. https://randomascii.wordpress.com/2012/06/05/in-praise-of-idleness/ – Luiz Felipe Jul 12 '22 at 17:47

Peeter Joot · Accepted Answer · 2011-11-14T22:00:07.720

So I'm wondering:

* Is it correct?

In the context mentioned, I would say yes.

* Is it optimal?

That's a loaded question. By reinventing the wheel you are also reinventing a lot of problems that have been solved by other implementations

I'd expect a waste loop on failure where you aren't trying to access the lock word.
Use of a full barrier in the unlock only needs to have release semantics (that's why you'd use __sync_lock_release, so that you'd get st1.rel on itanium instead of mf, or a lwsync on powerpc, ...). If you really only care about x86 or x86_64 the types of barriers used here or not don't matter as much (but if you where to make the jump to intel's itanium for an HP-IPF port then you wouldn't want this).
you don't have the pause() instruction that you'd normally put before your waste loop.
when there is contention you want something, semop, or even a dumb sleep in desperation. If you really need the performance that this buys you then the futex suggestion is probably a good one. If you need the performance this buys you bad enough to maintain this code you have a lot of research to do.

Note that there was a comment saying that the release barrier wasn't required. That isn't true even on x86 because the release barrier also serves as an instruction to the compiler to not shuffle other memory accesses around the "barrier". Very much like what you'd get if you used asm ("" ::: "memory" ).

* on compare and swap

On x86 the sync_lock_test_and_set will map to a xchg instruction which has an implied lock prefix. Definitely the most compact generated code (esp. if you use a byte for the "lock word" instead of an int), but no less correct than if you used LOCK CMPXCHG. Use of compare and swap can be used for fancier algorthims (like putting a non-zero pointer to metadata for the first "waiter" into the lockword on failure).

score 4 · Answer 3 · answered Sep 05 '09 at 14:02

In response to your questions:

Looks ok to me
Assuming the OS supports GCC (and GCC has the functions implemented); this should work on all x86 Operating Systems. The GCC documentation suggests that a warning will be produced if they are not supported on a given platform.
There's nothing x86-64 specific here, so I don't see why not. This can be expanded to cover any architecture that GCC supports, however there maybe more optimal ways of achieving this on non x86 architectures.
You might be slightly better off with using __sync_lock_release() in the unlock() case; as this will decrement the lock and add a memory barrier in a single operation. However, assuming that your assertion that there will rarely be contention; it looks good to me.

score 4 · Answer 4 · answered Sep 05 '09 at 16:29

4

If you're on a recent version of Linux, you may be able to use a futex -- a "fast userspace mutex":

A properly programmed futex-based lock will not use system calls except when the lock is contended

In the uncontested case, which you're trying to optimize for with your spinlock, the futex will behave just like a spinlock, without requiring a kernel syscall. If the lock is contested, the waiting takes place in the kernel without busy-waiting.

answered Sep 05 '09 at 16:29

Commodore Jaeger

32,280
4
54
44

To whoever is reading this. on Windows you should use the equivalent in behavior to futex. Critical Sections ! https://learn.microsoft.com/en-us/windows/win32/sync/critical-section-objects – Luiz Felipe Jul 12 '22 at 17:38

score 4 · Answer 5 · edited Jul 30 '11 at 21:59

4

I wonder if the following CAS implementation is the correct one on x86_64. It is almost twice faster on my i7 X920 laptop (fedora 13 x86_64, gcc 4.4.5).

inline void lock(volatile int *locked) {
    while (__sync_val_compare_and_swap(locked, 0, 1));
    asm volatile("lfence" ::: "memory");
}
inline void unlock(volatile int *locked) {
    *locked=0;
    asm volatile("sfence" ::: "memory");
}

edited Jul 30 '11 at 21:59

rpetrich

32,196
6
66
89

answered Nov 24 '10 at 20:55

Alex Raybosh

41
1

No need for a fence in lock. The fence in unlock must appear _before_ the assign to locked. – rasmus Apr 08 '12 at 00:06

score 2 · Answer 6 · answered Sep 05 '09 at 14:22

2

I can't comment on correctness, but the title of your question raised a red flag before I even read the question body. Synchronization primitives are devilishly hard to ensure correctness... if at all possible, you're better off using a well-designed/maintained library, perhaps pthreads or boost::thread.

answered Sep 05 '09 at 14:22

Jason S

184,598
164
608
970

4

I have a very good reason to not use pthreads or boost::threads in this specific case. – Hongli Sep 05 '09 at 15:56
@Varaquilex -- Please read more closely. I linked to *packages*, not to *answers*. – Jason S Aug 04 '16 at 15:48
@JasonS Oh, right. It was a late night reviewing, should have been slipped. Apologies. (will remove this comment and my previous comment to remove clutter) – Varaquilex Aug 04 '16 at 23:07

score 2 · Answer 7 · answered Jul 25 '14 at 09:37

There are a few wrong assumptions.

First, SpinLock makes sense only if ressource is locked on another CPU. If ressource is locked on same CPU (which is always the case on uniprocessor systems), you need to relax scheduler in order unlock ressource. You current code will work on uniprocessor system because scheduler will switch tasks automaticaly, but it a waste of ressource.

On multi-processor system, same thing can happends, but task may migrate from one CPU to another. In short, use of spin lock is correct if you garantee that your tasks will run on different CPU.

Secondly, locking a mutex IS fast (as fast as spinlock) when is is unlocked. Mutexes locking (and unlocking) is slow (very slow) only if mutex is already locked.

So, in your case, I suggest to use mutexes.

score 1 · Answer 8 · answered Sep 05 '09 at 14:45

One improvement is suggest is using TATAS (test-and-test-and-set). Using CAS operations are considered quite expensive for the processor, so it's better to avoid them if possible. Another thing, make sure you won't suffer from priority inversion (what if a thread with a high priority tries to acquire the lock while a thread with low priority tries to free the lock? On Windows for example this issue will ultimately by solved by the scheduler using a priority boost, but you can explicitly give up your thread's time slice in case you didn't succeed in acquiring the lock in you last 20 tries (for example..)

I'm not sure this is an improvement, given the OP's stated assumption that contention is *extremely* rare. In TATAS, the first test is to cheaply check if the lock is held, and to spin in the cheap non-interlocked code until the lock looks free. Only then does it advance to the expensive interlocked test-and-set. In the OP's case, the lock almost always *is* free, so this just adds another test that 99.99999% of the time immediately falls through to the interlocked test. — PaulMcG, Sep 06 '09 at 05:31

score 1 · Answer 9 · answered Nov 13 '21 at 16:34

1

To make your spinlock implementation more efficient , you can yield to the scheduler whenever the lock acquisition fails .

void lock() {
    while (__sync_lock_test_and_set(&exclusion, 1)) yield(-1) ;
}

answered Nov 13 '21 at 16:34

Mohamed yassine Boukhari

11
1

"you can yield". no, you MUST yield, otherwise your implementation is wrong and leads to starvation. – Luiz Felipe Jul 12 '22 at 17:39

score 0 · Answer 10 · answered Sep 05 '09 at 15:15

0

Your unlock procedure doesn't need the memory barrier; the assignment to exclusion is atomic as long as it dword aligned on the x86.

answered Sep 05 '09 at 15:15

Ira Baxter

93,541
22
172
341

The memory barrier isn't there to ensure an atomic write to the lock. – Logan Capaldo Sep 05 '09 at 15:24
That's right. It doesn't have anything to do with the atomicity of the write. That's my point; it doesn't add anything at all. – Ira Baxter Sep 05 '09 at 16:10
Yes it does. See http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html – Ken Sep 05 '09 at 19:10
@Ken: The memory write to location "exclusion" on free will be ordered after the write that locks it if they both execute on the same CPU. If CPU 1 does the lock, and CPU 2 does the unlock, then a memory ordering problem might occur; but the only way for this to occur is for CPU 2 to decide it has the lock, which it cannot reasonably do without attemption to acquire the lock (encounters membar), or reading the "exclusion" location, which it can't reasonably see as 1 until after CPU 1 actually issues the write. The example discussed in Pugh is a double-checked lock. How is that relevant? – Ira Baxter Sep 06 '09 at 00:02
The barrier there is to avoid reorder, which is essential. – W.Sun May 01 '13 at 19:22
@IraBaxter: +1, full memory fence there is useless and harms performance! – Jan 25 '14 at 03:01
@W.Sun: On x86 in this case, to avoid reordering, one should simply use compiler barrier and not memory fences. – Jan 25 '14 at 03:02

score 0 · Answer 11 · answered Mar 03 '13 at 18:52

In the specific case of x86 (32/64) I don't think you need a memory fence at all in the unlock code. x86 doesn't do any reordering, except that stores are first put in a store buffer and so them becoming visible can be delayed for other threads. And a thread that does a store and then reads from the same variable will read from its store buffer if it has not yet been flushed to memory. So all you need is an asm statement to prevent compiler reorderings. You run the risk of one thread holding the lock slightly longer than necessary from the perspective of other threads, but if you don't care about contention that shouldn't matter. In fact, pthread_spin_unlock is implemented like that on my system (linux x86_64).

My system also implements pthread_spin_lock using lock decl lockvar; jne spinloop; instead of using xchg (which is what __sync_lock_test_and_set uses), but I don't know if there's actually a performance difference.

Is my spin lock implementation correct and optimal?

11 Answers11

Linked