cmpxchg for WORD faster than for BYTE

Question

Yesterday I posted this question on how to write a fast spinlock. Thanks to Cory Nelson I seem to have found a method which outperforms the other methods discussed in my question. I use the CMPXCHG instruction to check if the lock is 0 and thereby free. CMPXCHG operates on ´BYTE´, WORD and DWORD. I would assume that the instruction would operate faster on BYTE. But I wrote a lock implementing each of the datatypes:

inline void spin_lock_8(char* lck)
{
    __asm
    {
        mov ebx, lck                        ;move lck pointer into ebx
        xor cl, cl                          ;set CL to 0
        inc cl                              ;increment CL to 1
        pause                               ;
        spin_loop:
        xor al, al                          ;set AL to 0
        lock cmpxchg byte ptr [ebx], cl     ;compare AL to CL. If equal ZF is set and CL is loaded into address pointed to by ebx
        jnz spin_loop                       ;jump to spin_loop if ZF
    }
}
inline void spin_lock_16(short* lck)
{
    __asm
    {
        mov ebx, lck
        xor cx, cx
        inc cx
        pause
        spin_loop:
        xor ax, ax
        lock cmpxchg word ptr [ebx], cx
        jnz spin_loop
    }
}
inline void spin_lock_32(int* lck)
{
    __asm
    {
        mov ebx, lck
        xor ecx, ecx
        inc ecx
        pause
        spin_loop:
        xor eax, eax
        lock cmpxchg dword ptr [ebx], ecx
        jnz spin_loop
    }
}
inline spin_unlock(<anyType>* lck)
{
    __asm
    {
        mov ebx, lck
        mov <byte/word/dword> ptr [ebx], 0
    }
}

The lock was then tested using the following pseudo-code (please note that the lcm-pointer always will point to an address dividable by 4):

<int/short/char>* lck;
threadFunc()
{
    loop 10,000,000 times
    {
        spin_lock_8/16/32 (lck);
        spin_unlock(lck);
    }
}
main()
{
    lck = (char/short/int*)_aligned_malloc(4, 4);//Ensures memory alignment
    start 1 thread running threadFunc and measure time;
    start 2 threads running threadFunc and measure time;
    start 4 threads running threadFunc and measure time;
    _aligned_free(lck);
}

I've gotten the following results measured in msecs on a processor with 2 physical cores able to run 4 threads (Ivy Bridge).

           1 thread    2 threads     4 threads
8-bit      200         700           3200
16-bit     200         500           1400
32-bit     200         900           3400

The data suggests that all functions take an equal amount of time to execute. But when multiple threads have to check if lck == 0 using a 16-bit can be significantly faster. Why is that? I do not suppose it has something to do with the alignment of the lck?

Thanks in advance.

'I know this isn't a lot of difference but as a spinlock is a heavily used object' - haven't explicitly used a single one in 30+ years of multithreaded software development. — Martin James, Aug 15 '12 at 23:23
Try moving the `pause` instruction INSIDE the spin loop rather than outside the loop. 16 bit instuctions require extra 0x66/0x67 prefix bytes making them slightly larger/slower than 8 or 32 bit instructions. So it may be the extra overhead is slowing down the loop enough to reduce contention in the 16-bit case. — Chris Dodd, Aug 16 '12 at 00:41
I wouldn't suprised if these locks lead to random corruption, as they modify ebx (a callee save register) without saving and restoring it, which might corrupt some value that a caller expects to be preserved. Use edx instead. — Chris Dodd, Aug 16 '12 at 00:46
@ChrisDodd: Since this is no naked assembly method, EBX register is saved and restored by the compiler in surrounding code. — bkausbk, Aug 24 '12 at 13:02
I strongly recommend using intrinsics to do things like pause and cmpxchg. Inline assembly is not the way to go. And yes, the pause instruction is definitely supposed to be inside the loop. It pauses the thread for 10 cycles or so (which boosts other thread for hyperthreading) and it ALSO warns the CPU that a memory ordering violation is likely, and saves the CPU having to restart things when it detects a memory ordering violation at retirement and has to flush/restart instruction. — doug65536, Dec 18 '12 at 23:21

score 2 · Accepted Answer · answered Dec 23 '12 at 12:14

From what I recall the lock works on a word (2 bytes). It was written that way when first introduced in the 486.

If you carry a lock on a different size, it actually generate the equivalent of 2 locks (lock word A and word B for a double word.) For a byte it probably has to prevent the locking of the second byte, which is somewhat similar to 2 locks...

So your results are in line with the CPU optimizations.

score 2 · Answer 2 · answered Dec 23 '12 at 13:11

Imagine there's 1234 threads and 16 CPUs. One thread acquires the spinlock, then the OS does a task switch. Now you've 16 CPUs, each running one of the remaining 1233 threads, all spinning in a remarkably pointless way for however long it takes for the OS to give CPU time back to the only thread that can release the spinlock. This means the entire OS can basically lock up (with all CPUs going flat out) for a few seconds. This is seriously retarded; so how do you fix it?

You fix it by not using spinlocks in user-space. Spinlocks should only ever be used if/when task switches can be disabled; and only the kernel should be able to disable task switches.

More specifically, you need to use a mutex. Now the mutex may spin initially before giving up and making the thread wait for the lock, and (for typical/low contention cases) this does help, but it'd still be a mutex and is not a spinlock.

Next; for sane software, what matters (for performance) is avoiding lock contention, and then making sure that the uncontended case is fast (and a good mutex won't cause a task switch if there's no contention). You are measuring the contended/irrelevant case.

Finally; your lock is bad. To avoid excessive use of the lock prefix you should test if you might be able to acquire without any lock prefix, and only when you might be able to acquire should you use the lock prefix. Intel (and probably lots of other people) call this strategy "test; then (test and set)". In addition you've failed to understand the purpose of pause (or "rep nop" for assemblers that are so bad that they don't support 10 year old instructions).

A half decent spinlock might look something like:

acquire:
    lock bts dword [myLock],0   ;Optimistically attempt to acquire
    jnc .acquired               ;It was acquired!
.retry:
    pause
    cmp dword [myLock],0        ;Should we attempt to acquire again?
    jne .retry                  ; no, don't use `lock`
    lock bts dword [myLock],0   ;Attempt to acquire
    jc .retry                   ;It wasn't acquired, so go back to waiting
.acquired:
    ret

release:
    mov dword [myLock],0        ;No lock prefix needed here as "myLock" is aligned
    ret

Also note that if you've failed to adequately minimise the chance of lock contention, then you do need to care about "fairness" and should not be using a spinlock. The problem with "unfair" spinlocks is that some tasks might be lucky and always get the lock, and some tasks may be unlucky and never get the lock because the lucky tasks have always got it. This has always been a problem for heavily contended locks, but for modern NUMA systems it's become a much more likely problem. In this case, at a minimum you should be using a ticket lock.

The basic idea of a ticket lock is to ensure that tasks acquire the lock in the order they arrive (and not some "possibly extremely bad" random order). For completeness, a ticket lock might look like this:

acquire:
    mov eax,1
    lock xadd [myLock],eax           ;myTicket = currentTicket, currentTicket++

    cmp [myLock+4],eax               ;Is it my turn?
    je .acquired                     ; yes
.retry:
    pause
    cmp [myLock+4],eax               ;Is it my turn?
    jne .retry                       ; no, wait
.acquired:
    ret

release:
    lock inc dword [myLock+4]
    ret

tl;dr; You shouldn't be using the wrong tool for the job (spinlocks) to begin with; but if you insist on using the wrong tool then at least get the wrong tool implemented properly... :-)

Note that the only way to properly implement a mutex is by using a spinlock unless you want the kernel to allow mutexes only when you do task switching (and assuming that all threads are stopped when that happens.) I can tell you that in Linux mutexes are using a spinlock. — Alexis Wilke, May 30 '14 at 21:52

cmpxchg for WORD faster than for BYTE

2 Answers2