What is the minimum X86 assembly needed for a spinlock

Question

To implement a spinlock in assembly. Here I post a solution I came up with. Is it correct? Do you know a shorter one?

lock:

    mov ecx, 0
.loop:
    xchg [eax], ecx
    cmp ecx, 0
    je .loop

release:

    lock dec dword [eax]

eax is initialized to -1 (which means lock is free). This should work for many threads (not necessarily 2).

Note: `xor ecx, ecx` is favoured over `mov ecx, 0` for size and speed. — Polynomial, Apr 08 '14 at 17:00
On another note, you missed the lock prefix from your first xchg. — Polynomial, Apr 08 '14 at 17:00
@Polynomial no need for the lock prefix on xchg, it is implied. — Jester, Apr 08 '14 at 17:03
I believe you meant to say [eax] is initialized to -1, that is, the contents of the memory location pointed to by eax, not eax itself. When you say the minimum x86 assembly, do you mean # of instructions, # of bytes, or average expected # cycles to lock and unlock? — amdn, Apr 08 '14 at 17:12

score 5 · Accepted Answer · answered Apr 08 '14 at 18:23

5

Shortest would probably be:

acquire:
    lock bts [eax],0
    jc acquire

release:
    mov [eax],0

For performance, it's best to use a "test, test and set" approach, and use pause, like this:

acquire:
    lock bts [eax],0    ;Optimistic first attempt
    jnc l2              ;Success if acquired
l1:
    pause
    test [eax],1        
    jne l1              ;Don't attempt again unless there's a chance

    lock bts [eax],0    ;Attempt to acquire
    jc l1               ;Wait again if failed

l2:

release:
    mov [eax],0

For debugging, you can add extra data to make it easier to detect problems, like this:

acquire:
    lock bts [eax],31         ;Optimistic first attempt
    jnc l2                    ;Success if acquired

    mov ebx,[CPUnumber]
    lea ebx,[ebx+0x80000000]
    cmp [eax],ebx             ;Is the lock acquired by this CPU?
    je .bad                   ; yes, deadlock
    lock inc dword [eax+4]    ;Increase "lock contention counter"
l1:
    pause
    test [eax],0x80000000        
    jne l1                    ;Don't attempt again unless there's a chance

    lock bts [eax],31         ;Attempt to acquire
    jc l1                     ;Wait again if failed

l2: mov [eax],ebx             ;Store CPU number

release:
    mov ebx,[CPUnumber]
    lea ebx,[ebx+0x80000000]
    cmp [eax],ebx             ;Is lock acquired, and is CPU same?
    jne .bad                  ; no, either not acquired or wrong CPU
    mov [eax],0

answered Apr 08 '14 at 18:23

Brendan

35,656
2
39
66

If the first attempt succeeds, wouldn't it store junk in `[eax]`? (in the debugging version) – harold Apr 08 '14 at 18:41
@harold: It uses one bit for the lock, and the remaining 31 bits to keep track of who acquired the lock (so that it can detect when you're releasing a lock that you didn't acquire but someone else did; and deadlocks). – Brendan Apr 08 '14 at 18:55
Yes but I mean `ebx` hasn't been set to a sensible value yet if you take that path, right? – harold Apr 08 '14 at 18:59
Why `lea ebx,[ebx+0x80000000]`? Is that left-over from using a different register? And BTW, shouldn't `CPUnumber` be `ThreadNumber`? It's not physical CPU that matters for correctness. Anyway, nice. This is basically the same as my pure asm answer using `xchg` on [Locks around memory manipulation via inline assembly](https://stackoverflow.com/a/37246263). – Peter Cordes Aug 09 '18 at 10:12
@PeterCordes: It's just "ebx = CPUnumber | 0x80000000" (where highest bit is used for a locked/not locked flag, and lowest 31 bits are the CPU number). For most spinlocks it should be `CPUnumber` (e.g. scheduler lock/s in kernel), and for most cases where `ThreadNumber` would be used it should be a mutex and not a spinlock (imagine single-CPU where a task switch happens immediately after you acquire a lock, and any number of other threads waste all CPU time spinning until scheduler finally gives your thread CPU time again). – Brendan Aug 23 '18 at 13:06
My point was why `lea` instead of `add` or `or`, or even `bts ebx, 31` (might save code-size vs. a disp32 or imm32)? You don't normally use `lea` unless you want the result in a different register than the source, because it runs on fewer ports. In this case it doesn't cost any extra code-size, though; opcode + modrm + disp32 vs. imm32 – Peter Cordes Aug 23 '18 at 13:31
1

@PeterCordes: Ah, I see now. I have no idea (too much time has passed since I wrote it); but I have a feeling there was an earlier draft where I didn't want to modify flags (e.g. a test/comparison before and conditional branch after); and (while editing/fixing the example before posting) the code around it was changed but the original `LEA` stayed. – Brendan Aug 25 '18 at 13:44

score 1 · Answer 2 · answered Apr 08 '14 at 17:08

1

Your code is fine, but if you're looking for high performance I'd suggest this instead:

  xor ecx, ecx
.loop:
  lock xchg [eax], ecx
  test ecx, ecx
  jz .loop

Reasons:

xor ecx, ecx is smaller and doesn't require a literal, and modern CPUs have this hardwired to fast register zero.
test ecx, ecx can be marginally smaller and faster than cmp ecx, 0, because you don't need to load a literal and test is just a bitwise AND operation rather than a subtraction.

P.S. I always put the lock prefix in there regardless of whether it is implied, for readability reasons - it makes it obvious that I'm doing a locked operation.

answered Apr 08 '14 at 17:08

Polynomial

27,674
12
80
107

Note that the prefix isn't free - it occupies a byte. That's hardly the end of the world, but if you are paying attention to the size of the other instructions... – gsg Apr 08 '14 at 18:18
1

@gsg I just checked on my compiler, and it doesn't add the additional byte for the lock prefix, as it detects that it's superfluous on xchg. – Polynomial Apr 08 '14 at 18:39
Huh, ok. I checked `gcc` and `as`, which both do. Know your tools, I guess. – gsg Apr 08 '14 at 18:44
I'm a Windows-monkey, no gcc for me! ;] – Polynomial Apr 08 '14 at 18:45
1

You're optimizing for the wrong things. If you care about performance, don't spin on `xchg`, spin read-only until you see the lock available, to reduce contention with the core trying to unlock. And `pause` is essential. See Brendan's answer here, or mine on [Locks around memory manipulation via inline assembly](https://stackoverflow.com/a/37246263) – Peter Cordes Aug 09 '18 at 09:49

Olof Forshell · Answer 3 · 2018-08-09T13:35:09.683

Your code is fine and you can always try to make it shorter if you have space problems.

Other answers mention performance and that displays a basic ignorance of how locks work.

When an attempt to lock is initiated the core in question raises a signal on one of its pins (LOCK) which tells all other cores, their caches, all memory and all bus-mastering devices (because they can update RAM independently of the cores) to complete any outstanding memory operations. When they have done this they collectively raise another signal - lock acknowledge (LOCKA) - which is returned to the original core and the memory exchange takes place. After this the LOCK signal is switched off.

Once you have arrived here you are able to look at the value you fetched using xchg. If it turns out that another task/thread owns the lock you will need to perform the lock sequence all over again.

Assume the slowest bus-mastering device anywhere on your computer is a 33MHz PCI card. If it is doing something it will need any number of PCI bus clock cycles to complete. Every cycle there means one hundred wait cycles on a 3.3GHz CPU. Put this in the perspective of saving a cycle or two in the lock sequence. There are several buses in a CPU, the chipset and the motherboard and some, all or none of them may be active at any one time - such as when the LOCK is initiated. The active bus that takes longest to reply with LOCKA will determine how fast the lock completes.

Try it yourself: measure how long it takes to do ten million spinlocks (grab and release).

I wrote some more on spinlocks here, here and here.

The performance trick with bus-locks (spinlocks, critical sections in Windows) is to use them as seldom as possible which means organizing your data to make this possible. A bus-lock will likely complete faster on a non-server computer. This is because the bus-mastering devices on a server are operating more or less constantly. So if your application is server-based economizing on bus-locks can be critical to maintain performance.

EDIT

To Peter Cordes,

You state that

... it's not related to bus-mastering, at least not on CPUs since at least Nehalem

From the latest intel System Programming Guide:

8.1.4 Effects of a LOCK Operation on Internal Processor Caches

For the Intel486 and Pentium processors, the LOCK# signal is always asserted on the bus during a LOCK operation, even if the area of memory being locked is cached in the processor.

For the P6 and more recent processor families, if the area of memory being locked during a LOCK operation is cached in the processor that is performing the LOCK operation as write-back memory and is completely contained in a cache line, the processor may not assert the LOCK# signal on the bus. Instead, it will modify the memory location internally and allow it’s cache coherency mechanism to ensure that the operation is carried out atomically. This operation is called “cache locking.” The cache coherency mechanism automatically prevents two or more processors that have cached the same area of memory from simultaneously modifying data in that area.

In the second paragraph it says

... the processor may not assert the LOCK# signal on the bus.

Now, I don't know about you but to me at least "may not" sounds suspiciously unlike "will not" or "won't".

The numbers I have produced may or may not be correct - even I make mistakes now and then - but I challenge you to stop quoting this or that "authority" and instead get your hands dirty by doing the work to either disprove my numbers, find the error(s) or discrepancies. I included the relevant source code in another thread (where I also griped about your lofty, theoretical, armchair-based comments) so it won't take you forever to get started.

Start, for example, by proving that I am

over-stating the cost of locking in the no-contention case ...

I look forward to your analysis.

`xchg [mem], reg` has back-to-back latency of ~25 cycles on Sandybridge-family CPUs (on normal memory, i.e. Write-Back cacheable), with the cache line staying in Modified state in L1d cache of the core. https://agner.org/optimize/. It only requires a cache-lock, not a bus-lock, for an aligned dword, to make this work thanks to MESI. (And thanks to modern x86 having cache-coherent DMA.) You're over-stating the cost of locking in the no-contention case where the same thread is repeatedly locking/unlocking the same lock. — Peter Cordes, Aug 09 '18 at 09:43
Good design to avoid locking is definitely helpful, but it's not related to bus-mastering, at least not on CPUs since at least Nehalem and probably somewhat earlier. — Peter Cordes, Aug 09 '18 at 09:44
The "may not" caveat is there because *misaligned* `lock` prefixes still have to do that, and are prohibitively expensive. (There are recent CPU features to make that fault, to aid in detecting accidental use in cases where perf counters are hard to use.) I actually tested, see my answer on [Is it possible for two lock() statements to be executed at the same time on two CPU cores ? Do CPU cores tick at the same time?](https://stackoverflow.com/q/66996864) - and found perfect scaling on quad-core Skylake for different threads running `xchg` on different locations. — Peter Cordes, Feb 10 '22 at 23:35

What is the minimum X86 assembly needed for a spinlock

3 Answers3

Linked