Fastest inline-assembly spinlock

Question

I'm writing a multithreaded application in c++, where performance is critical. I need to use a lot of locking while copying small structures between threads, for this I have chosen to use spinlocks.

I have done some research and speed testing on this and I found that most implementations are roughly equally fast:

Microsofts CRITICAL_SECTION, with SpinCount set to 1000, scores about 140 time units
Implementing this algorithm with Microsofts InterlockedCompareExchange scores about 95 time units
Ive also tried to use some inline assembly with __asm {} using something like this code and it scores about 70 time units, but I am not sure that a proper memory barrier has been created.

Edit: The times given here are the time it takes for 2 threads to lock and unlock the spinlock 1,000,000 times.

I know this isn't a lot of difference but as a spinlock is a heavily used object, one would think that programmers would have agreed on the fastest possible way to make a spinlock. Googling it leads to many different approaches however. I would think this aforementioned method would be the fastest if implemented using inline assembly and using the instruction CMPXCHG8B instead of comparing 32bit registers. Furthermore memory barriers must be taken into account, this could be done by LOCK CMPXHG8B (I think?), which guarantees "exclusive rights" to the shared memory between cores. At last [some suggests] that for busy waits should be accompanied by NOP:REP that would enable Hyper-threading processors to switch to another thread, but I am not sure whether this is true or not?

From my performance-test of different spinlocks, it is seen that there is not much difference, but for purely academic purpose I would like to know which one is fastest. However as I have extremely limited experience in the assembly-language and with memory barriers, I would be happy if someone could write the assembly code for the last example I provided with LOCK CMPXCHG8B and proper memory barriers in the following template:

__asm
{
     spin_lock:
         ;locking code.
     spin_unlock:
         ;unlocking code.
}

+1 for giving good sources and info before asking. i think you gave more than you need. thx — huseyin tugrul buyukisik, Aug 14 '12 at 19:29
How much exactly is a lot? It would need to be an awful damn lot to worry about how fast you can spin. You're sure there is no better way you can restrict access here? Remember also that the speed you're spinning at doesn't affect when you actually acquire the lock. It doesn't matter how fast you're spinning, the other guy has to unlock it first. Consider looping over a yield() to pass execution to another running thread or process if it turns out that you're going to be spinning for a long time. — Wug, Aug 14 '12 at 19:36
`rep nop` aka `pause` also makes P4 not do retarded things when you leave the spin loop. Intel's manual explicitly recommends its use in spin-wait loops. Are you allowed to use `XACQUIRE` and `XRELEASE` (not available until Haswell)? — harold, Aug 14 '12 at 19:38
@Wug The time given in the performance tests, are the time it takes 2 threads to simultanously lock, copy 4 ints(to add realism) and unlock spinlock maybe 10000000 times (I don't the source code on this computer). The time units given does not give any information about how many loops have been run. — sigvardsen, Aug 14 '12 at 19:42
When you want performance, use lock free/contention free data structures on your fast path, and only locks on your slow path — PlasmaHH, Aug 14 '12 at 19:49
@PlasmaHH I know that you can copy data but only up to 32bit on 32bit machines and 64bit on 64bit machines with microsoft interlocked operations. But I am not aware that you can do atomic reads and writes on structures larger than the above given examples? Could you point me to some reading/literature or examples? Can it be done in assembly? — sigvardsen, Aug 14 '12 at 19:56
@sigvardsen: If you explained more of what you're doing, we may be able to give alternative solutions. — GManNickG, Aug 14 '12 at 20:06
@GManNickG I need to copy a structure that consists of 8 ints = 256 bits from one thread to another - atomicly ofcourse. — sigvardsen, Aug 14 '12 at 20:14
@sigvardsen what if you atomically copy a *pointer* to that data? — harold, Aug 14 '12 at 20:17
@sigvardsen: You're missing the point. *Why* are you doing that? What problem are you solving? — GManNickG, Aug 14 '12 at 20:24
I must say that it's hard to believe that an algorithm's bottleneck is performance of spinlocks. Maybe you are overusing them? Maybe you don't need to lock that much? Maybe improving the algorithm design will make this optimization completely unnecessary? Maybe you can sacrifice more memory for the sake of performance? Of course it's perfectly possible that you could have finally ended up with shitty spinlocks but as a rule of thumb: "it's always your fault". http://www.codinghorror.com/blog/2008/03/the-first-rule-of-programming-its-always-your-fault.html — Sedat Kapanoglu, Aug 14 '12 at 20:48
@ssg As I mention in my question i realize that all these implementations of spinlock are almost equally fast. But I rose this question merely of acedemic interest. It is good practice to implement the best algorithm although it is unneccesary. — sigvardsen, Aug 14 '12 at 20:56
Memory barriers have nothing to do with "exclusive rights". It's about memory ordering, make sure to read the Intel docs about the memory ordering on x86, which makes barriers uneccessary for most cases. — Gunther Piez, Aug 14 '12 at 21:41
@sigvardsen Ok I missed the "purely academic purpose" part, my mistake :) — Sedat Kapanoglu, Aug 14 '12 at 22:37
@sigvardsen: on x86_64 you have 16byte atomic swaps, these are usually enough for almost all lock free data structures. Lots of those structures even work (partly) without lock prefixes, just using the fact that for properly aligned things read/writes are atomic, sometimes depending on memory barriers too. There is really lots of material to read on the nets. — PlasmaHH, Aug 15 '12 at 09:09

Necrolis · Answer 1 · 2012-08-17T14:36:17.120

Although there is already an accepted answer, there are a few things that where missed that could be used to improve all the answers, taken from this Intel article, all above fast lock implementation:

Spin on a volatile read, not an atomic instruction, this avoids unneeded bus locking, especially on highly contended locks.
Use back-off for highly contested locks
Inline the lock, preferably with intrinsics for compilers where inline asm is detrimental (basically MSVC).

score 8 · Answer 2 · edited May 23 '17 at 12:02

8

I'm usually not one to gripe about someone striving to achieve fast code: it's usually a very good exercise which results in better understanding of programming and faster code.

I won't gripe here either but I can state unequivocally that the question of a fast spinlock 3 instructions long or a few more is - at least on the x86 achitecture - a futile chase.

Here's why:

Invoking a spinlock with a typical code sequence

lock_variable DW 0    ; 0 <=> free

mov ebx,offset lock_variable
mov eax,1
xchg eax,[ebx]

; if eax contains 0 (no one owned it) you own the lock,
; if eax contains 1 (someone already does) you don't

Freeing a spinlock is trivial

mov ebx,offset lock_variable
mov dword ptr [ebx],0

The xchg instruction raises the lock pin on the processor which in effect means I want the bus during the next few clock cycles. This signal weaves its way through the caches and down to the slowest bus-mastering device which is usually the PCI bus. When every busmastering device has completed the locka (lock acknowledge) signal is sent back. Then the actual exchange takes place. The problem is that the lock/locka sequence takes a VERY long time. The PCI bus may run at 33MHz with several cycles of latency. On a 3.3 GHz CPU that means each PCI bus cycle takes one hundred CPU cycles.

As a rule of thumb I assume that a lock will take between 300 and 3000 CPU cycles to complete and in the end I don't know if I'll even own the lock. So the few cycles you can save by a "fast" spinlock will be a mirage because no lock is like the next, it will depend on your bus situation during that short time.

________________EDIT________________

I just read that the spinlock is a "heavily used object." Well, you obviously don't understand that a spinlock consumes an enormous amount of CPU cycles evey time it is invoked. Or, to put it another way, every time you invoke it you lose a significant amount of your processing capability.

The trick when using spinlocks (or their bigger sibling, the critical section) is to use them as sparingly as possible while still achieving the intended program function. Using them all over the place is easy and you'll end up with lackluster performance as a result.

It's not all about writing fast code, it's also about organizing your data. When you write "copying small structures between threads" you should realize that the lock may take hundreds of times longer to complete than the actual copying.

________________EDIT________________

When you compute an average lock time it will probably say very little as it is measured on your machine which may not be the intended target (which may have entirely different bus usage characteristics). For your machine the average will be made up of individual very fast times (when the bus-mastering activities didn't interfere) all the way up to very slow times (when bus-mastering interference was significant).

You could introduce code which determines the fastest and slowest cases and calculate the quotient to see just how greatly the spinlock times can vary.

________________EDIT________________

May 2016 update.

Peter Cordes promoted the idea that "tuning a lock in the un-contended case can make sense" and that lock times of many hundreds of clock cycles do not occur on modern CPUs except in situations where the lock variable is misaligned. I began wondering if my previous test program - written in 32-bit Watcom C - might be hampered by WOW64 as it was running on a 64-bit OS: Windows 7.

So I wrote a 64-bit program and compiled it with TDM's gcc 5.3. The program utilizes the implicitly bus-locking instruction variant "XCHG r,m" for locking and a simple assignment "MOV m,r" for unlocking. In some lock variants the lock variable was pre-tested to determine if it was feasible to even attempt a lock (using a simple compare "CMP r,m", probably not venturing outside L3). Here it is:

// compiler flags used:

// -O1 -m64 -mthreads -mtune=k8 -march=k8 -fwhole-program -freorder-blocks -fschedule-insns -falign-functions=32 -g3 -Wall -c -fmessage-length=0

#define CLASSIC_BUS_LOCK
#define WHILE_PRETEST
//#define SINGLE_THREAD

typedef unsigned char      u1;
typedef unsigned short     u2;
typedef unsigned long      u4;
typedef unsigned int       ud;
typedef unsigned long long u8;
typedef   signed char      i1;
typedef          short     i2;
typedef          long      i4;
typedef          int       id;
typedef          long long i8;
typedef          float     f4;
typedef          double    f8;

#define usizeof(a) ((ud)sizeof(a))

#define LOOPS 25000000

#include <stdio.h>
#include <windows.h>

#ifndef bool
typedef signed char bool;
#endif

u8 CPU_rdtsc (void)
{
  ud tickl, tickh;
  __asm__ __volatile__("rdtsc":"=a"(tickl),"=d"(tickh));
  return ((u8)tickh << 32)|tickl;
}

volatile u8 bus_lock (volatile u8 * block, u8 value)
{
  __asm__ __volatile__( "xchgq %1,%0" : "=r" (value) : "m" (*block), "0" (value) : "memory");

  return value;
}

void bus_unlock (volatile u8 * block, u8 value)
{
  __asm__ __volatile__( "movq %0,%1" : "=r" (value) : "m" (*block), "0" (value) : "memory");
}

void rfence (void)
{
  __asm__ __volatile__( "lfence" : : : "memory");
}

void rwfence (void)
{
  __asm__ __volatile__( "mfence" : : : "memory");
}

void wfence (void)
{
  __asm__ __volatile__( "sfence" : : : "memory");
}

volatile bool LOCK_spinlockPreTestIfFree (const volatile u8 *lockVariablePointer)
{
  return (bool)(*lockVariablePointer == 0ull);
}

volatile bool LOCK_spinlockFailed (volatile u8 *lockVariablePointer)
{
  return (bool)(bus_lock (lockVariablePointer, 1ull) != 0ull);
}

void LOCK_spinlockLeave (volatile u8 *lockVariablePointer)
{
  *lockVariablePointer = 0ull;
}

static volatile u8 lockVariable = 0ull,
                   lockCounter =  0ull;

static volatile i8 threadHold = 1;

static u8 tstr[4][32];    /* 32*8=256 bytes for each thread's parameters should result in them residing in different cache lines */

struct LOCKING_THREAD_STRUCTURE
{
  u8 numberOfFailures, numberOfPreTests;
  f8 clocksPerLock, failuresPerLock, preTestsPerLock;
  u8 threadId;
  HANDLE threadHandle;
  ud idx;
} *lts[4] = {(void *)tstr[0], (void *)tstr[1], (void *)tstr[2], (void *)tstr[3]};

DWORD WINAPI locking_thread (struct LOCKING_THREAD_STRUCTURE *ltsp)
{
  ud n = LOOPS;
  u8 clockCycles;

  SetThreadAffinityMask (ltsp->threadHandle, 1ull<<ltsp->idx);

  while (threadHold) {}

  clockCycles = CPU_rdtsc ();
  while (n)
  {
    Sleep (0);

#ifdef CLASSIC_BUS_LOCK
    while (LOCK_spinlockFailed (&lockVariable)) {++ltsp->numberOfFailures;}
#else
#ifdef WHILE_PRETEST
    while (1)
    {
      do
      {
        ++ltsp->numberOfPreTests;
      } while (!LOCK_spinlockPreTestIfFree (&lockVariable));

      if (!LOCK_spinlockFailed (&lockVariable)) break;
      ++ltsp->numberOfFailures;
    }
#else
    while (1)
    {
      ++ltsp->numberOfPreTests;
      if (LOCK_spinlockPreTestIfFree (&lockVariable))
      {
        if (!LOCK_spinlockFailed (&lockVariable)) break;
        ++ltsp->numberOfFailures;
      }
    }
#endif
#endif
    ++lockCounter;
    LOCK_spinlockLeave (&lockVariable);

#ifdef CLASSIC_BUS_LOCK
    while (LOCK_spinlockFailed (&lockVariable)) {++ltsp->numberOfFailures;}
#else
#ifdef WHILE_PRETEST
    while (1)
    {
      do
      {
        ++ltsp->numberOfPreTests;
      } while (!LOCK_spinlockPreTestIfFree (&lockVariable));

      if (!LOCK_spinlockFailed (&lockVariable)) break;
      ++ltsp->numberOfFailures;
    }
#else
    while (1)
    {
      ++ltsp->numberOfPreTests;
      if (LOCK_spinlockPreTestIfFree (&lockVariable))
      {
        if (!LOCK_spinlockFailed (&lockVariable)) break;
        ++ltsp->numberOfFailures;
      }
    }
#endif
#endif
    --lockCounter;
    LOCK_spinlockLeave (&lockVariable);

    n-=2;
  }
  clockCycles = CPU_rdtsc ()-clockCycles;

  ltsp->clocksPerLock =   (f8)clockCycles/           (f8)LOOPS;
  ltsp->failuresPerLock = (f8)ltsp->numberOfFailures/(f8)LOOPS;
  ltsp->preTestsPerLock = (f8)ltsp->numberOfPreTests/(f8)LOOPS;

//rwfence ();

  ltsp->idx = 4u;

  ExitThread (0);
  return 0;
}

int main (int argc, char *argv[])
{
  u8 processAffinityMask, systemAffinityMask;

  memset (tstr, 0u, usizeof(tstr));

  lts[0]->idx = 3;
  lts[1]->idx = 2;
  lts[2]->idx = 1;
  lts[3]->idx = 0;

  GetProcessAffinityMask (GetCurrentProcess(), &processAffinityMask, &systemAffinityMask);

  SetPriorityClass (GetCurrentProcess(), HIGH_PRIORITY_CLASS);
  SetThreadAffinityMask (GetCurrentThread (), 1ull);

  lts[0]->threadHandle = CreateThread (NULL, 65536u, (void *)locking_thread, (void *)lts[0], 0, (void *)&lts[0]->threadId);
#ifndef SINGLE_THREAD
  lts[1]->threadHandle = CreateThread (NULL, 65536u, (void *)locking_thread, (void *)lts[1], 0, (void *)&lts[1]->threadId);
  lts[2]->threadHandle = CreateThread (NULL, 65536u, (void *)locking_thread, (void *)lts[2], 0, (void *)&lts[2]->threadId);
  lts[3]->threadHandle = CreateThread (NULL, 65536u, (void *)locking_thread, (void *)lts[3], 0, (void *)&lts[3]->threadId);
#endif

  SetThreadAffinityMask (GetCurrentThread (), processAffinityMask);

  threadHold = 0;

#ifdef SINGLE_THREAD
  while (lts[0]->idx<4u) {Sleep (1);}
#else
  while (lts[0]->idx+lts[1]->idx+lts[2]->idx+lts[3]->idx<16u) {Sleep (1);}
#endif

  printf ("T0:%1.1f,%1.1f,%1.1f\n", lts[0]->clocksPerLock, lts[0]->failuresPerLock, lts[0]->preTestsPerLock);
  printf ("T1:%1.1f,%1.1f,%1.1f\n", lts[1]->clocksPerLock, lts[1]->failuresPerLock, lts[1]->preTestsPerLock);
  printf ("T2:%1.1f,%1.1f,%1.1f\n", lts[2]->clocksPerLock, lts[2]->failuresPerLock, lts[2]->preTestsPerLock);
  printf ("T3:%1.1f,%1.1f,%1.1f\n", lts[3]->clocksPerLock, lts[3]->failuresPerLock, lts[3]->preTestsPerLock);

  printf ("T*:%1.1f,%1.1f,%1.1f\n", (lts[0]->clocksPerLock+  lts[1]->clocksPerLock+  lts[2]->clocksPerLock+  lts[3]->clocksPerLock)/  4.,
                                    (lts[0]->failuresPerLock+lts[1]->failuresPerLock+lts[2]->failuresPerLock+lts[3]->failuresPerLock)/4.,
                                    (lts[0]->preTestsPerLock+lts[1]->preTestsPerLock+lts[2]->preTestsPerLock+lts[3]->preTestsPerLock)/4.);

  printf ("LC:%u\n", (ud)lockCounter);

  return 0;
}

The program was run on a DELL i5-4310U based computer with DDR3-800, 2 cores/2 HTs capable of 2.7GHz and a common L3 cache.

To begin with it appears the impact of WOW64 was negligible.

A single thread performing uncontended lock/unlocks was able to do so once every 110 cycles. Tuning the uncontended lock is useless: any code added to enhance the single XCHG instruction will only make it slower.

With four HTs bombarding the lock variable with lock attempts the situation changes radically. The time required to achieve a successful lock jumps to 994 cycles of which a significant part can be attributed to the 2.2 failed lock attempts. To put it another way, in a high-contention situation on average 3.2 locks must be attempted to achieve a successful lock. Obviously the 110 cycles have not become 110*3.2 but closer to 110*9. So other mechanisms are at play here just as with the tests on the older machine. Also, the average 994 cycles encompasses a range between 716 and 1157

The lock variants implementing pre-testing required about 95% of the cycles reuired by the simplest variant (XCHG). On average they would perform 17 CMPs to find it feasible to attempt 1.75 locks of which 1 was successful. I recommend using pre-testing not only because it is faster: it imposes less strain on the bus-locking mechanism (3.2-1.75=1.45 fewer lock attempts) even though it increases the complexity slightly.

edited May 23 '17 at 12:02

Community

1
1

answered Oct 19 '12 at 17:57

Olof Forshell

3,169
22
28

`xchg` with a memory operand, and other `lock`ed instructions, aren't anywhere near *that* slow on modern x86 CPUs. They're coherent with cache, but not with DMA, so they don't have to wait for a DRAM access. (And definitely not a PCI bus cycle, unless maybe you use it on a a memory-mapped-IO address!). e.g. on Sandybridge (which was current in 2012), `xchg` with memory has a latency of 25 cycles (on a cache hit), according to Agner Fog's tables. If the current core has the cache line in the Exclusive state, then the atomic RMW is easy, and it's the memory-barrier part that takes the time. – Peter Cordes Apr 17 '16 at 00:23
Worst-case for `xchg` is if another core has the cache line in Modified or Exclusive state with `lock`ed instructions in flight, or maybe if no core has that line cached, so it has to read from main memory. Note that it *doesn't* have to *write* back to main memory; you'll end up with the cache line in the M state in the current core's L1. – Peter Cordes Apr 17 '16 at 00:25
2

@Peter Cordes: So why does AF write - on the bottom of page 1 of instruction_tables.pdf - "A LOCK prefix typically costs more than a hundred clock cycles, even on a single processor system. This also applies to the XCHG instruction with a memory operand."? I guess Agner Fog could be considered being both right and wrong ... which is it? – Olof Forshell Apr 18 '16 at 21:20
Best-case is about 25c for an un-contended `xchg` with a cache-line that's already hot in L1, but typical case of 100c sounds reasonable when you consider that it orders other stores with loads, and you probably won't use it when there's no contention. I just timed a tiny loop myself, and even on a CPU as old as Merom/Conroe Core2, I got one per ~25c throughput. (Multiple `xchg` to different addresses still can't parallelize, because of the memory-barrier effect). Anyway, it definitely doesn't have to do PCIe bus cycles when used on normal memory regions. – Peter Cordes Apr 19 '16 at 01:02
Big picture: tuning a lock to be fast in the un-contended case *can* make sense, because that can be quite low overhead. This happens in the Linux kernel quite frequently, IIRC: the common case is no contention, but locking is still needed. – Peter Cordes Apr 19 '16 at 01:05
You are sure about this? You have personal experience? I do http://stackoverflow.com/questions/4912467/c-c-unsynchronized-threads-returning-a-strange-result/4921880#4921880 – Olof Forshell Apr 19 '16 at 18:12
1

Interesting numbers, thanks for the link. No, I don't have personal experience with tuning/measuring locking overhead. Your answer there proves that `xchg` isn't as slow as it would be if it had to interact with the PCI bus, though. You're also testing a loop that's completely bottlenecked on lock throughput, so there's no other work in the pipeline for the CPU to be doing while the `xchg` is happening. I think, but haven't tested, that `xchg` pipelines pretty well, so in real code that has lots of work to do besides locking, the uncontended-case overhead isn't terrible. – Peter Cordes Apr 19 '16 at 18:43
1

The most efficient locks are the ones you manage to avoid - if you're skilled enough. Newbies and people who don't understand them will think they are without significant cost and use them much more frequently than others in the same way that most people writing multi-threaded will end up with a slower application. In the end, your code might have to send the LOCK signal out on the bus and wait for the slowest bus-mastering device to return the LOCKA to say that the lock is in place. Or do you claim such situations cannot occur? What if your lock variable has been evicted from the cache? – Olof Forshell Apr 19 '16 at 21:13
On Intel Core2 (E6600), a `clflush [rsp-128]` / `xchg [rsp-128], eax` loop is the same speed as a `clfush [rsp-128]` / `add [rsp-128], eax` loop. (~410c per iteration at 2.4GHz, with DDR533 DRAM. `clflush` alone runs ~86c per iteration.) So the implicit `lock` in the `xchg` doesn't make the round trip to DRAM any worse. It's obviously still extremely expensive in this scenario. I'm not trying to claim that locking in general is cheap, and your point that good design to avoid locking is a great one. – Peter Cordes Apr 19 '16 at 21:35
1

I don't think locked insns have to wait for PCI devices. [x86 has cache-coherent DMA](https://lwn.net/Articles/2265/). This may be a recent thing, enabled by putting the memory controller on the CPU, making it easier for the cache to snoop DMA. I'd remove my downvote, and maybe upvote, if you took out (or provide evidence for) that part about external buses. – Peter Cordes Apr 19 '16 at 21:41
My answer from 2011 was excuted on a dual-core processor released in September, 2009. That processor had two L2 caches - one for each core and no (common) L3. The example with two locking threads would have required lots of round trips to RAM to perform a loop (read or read+write to lock, read+write to update num, write to unlock). A successful loop completion would have caused the other thread's cached information to become invalidated, forcing it to start from scratch. I will atempt to run the code on a system with an L3. – Olof Forshell Apr 21 '16 at 14:42
My single-threaded clflush tests were on an even older CPU (first-gen Core2: large per-core L2 caches), since my Sandybridge motherboard failed and I haven't got a replacement yet. My prediction is lower access times because it's only a round trip to L3 instead of DRAM when the other thread has invalidated the cache line in the first core's L1/L2. – Peter Cordes Apr 22 '16 at 23:50
This document http://support.amd.com/TechDocs/47414_15h_sw_opt_guide.pdf shows the unlikely case where an xchg instruction will result in a bus lock on a modern architecture. Published in 2014 it appears to still be in effect. – Olof Forshell Apr 24 '16 at 10:22
Are you talking about section 11.1.4? They say "In some hardware implementations", which I think means they're talking about CPUs other than AMD Family 15h (Bulldozer-family). It's possible that even Bulldozer falls back to a bus lock for `lock`ed insn that span two cache lines. Thanks for digging that up; interesting note. It almost certainly won't happen on an unaligned op that doesn't cross any cache-line boundaries, though. Bulldozer has efficient unaligned access support for AVX and integer ops, so they probably manage to avoid any horrible fallbacks for unaligned within a cache line. – Peter Cordes Apr 24 '16 at 10:45
It's not the op that needs to be aligned, it's the lock variable itself. In any case i hesitate to categorize the xchg (w lock) as an op like any other: it has specific, system-affecting properties that an ordinary integer op does not. – Olof Forshell Apr 24 '16 at 19:40
About your test (4 messages up) did you try it with two threads tied to separate cores? My test was two threads on two cores (affinity), 32-bit code running on 64-bit W7. – Olof Forshell Apr 24 '16 at 19:42
No, I didn't try with two threads. I was intentionally trying a different approach to measuring main memory latency, but I guess it would be more useful to have contended-lock numbers from the same hardware. re: alignment of the lock: yes, that's what I meant. i.e. When people say "aligned load", they mean a load from an aligned address. So "aligned op" means a locked operation on an aligned address, i.e. the lock variable is aligned. My guess (which could easily be wrong) about unaligned lock vars that don't cross cache lines is based on the idea that the CPU probably locks a whole line. – Peter Cordes Apr 25 '16 at 06:42
The individual cores are not aware of cache lines. The cache mechanism loads cache lines with memory requested by the core to mitigate the slow response time of RAM. On today's x86s a cache line is 64-bytes long aligned onto an even 64-byte RAM address. Even if you only want the 43rd byte in a cache line the mechanism loads all 64 bytes. If a location in a cache line is modified the mechanism writes that cache line to memory. In the same manner the core is unaware that exchange has an explicit lock: it executes the instruction, wait states are inserted and finally the core sees the data. – Olof Forshell May 27 '16 at 11:28
Private L1 cache is part each core. AMD's optimization guide which you linked earlier (pg 202: 11.1.4 ... "Cache Line Boundary") explicitly refers to a core locking its own cache instead of taking a bus lock. They say "naturally aligned", but the title of the section says "cache line boundary". Also, what part of the core doesn't need to know that `xchg` has an implicit lock? It has to trigger a memory barrier (like MFENCE) as well itself being atomic. Your description sounds like what might happen on a simple in-order CPU. – Peter Cordes May 27 '16 at 12:21
Nice update, interesting numbers. However, you're only measuring locking throughput, not the impact of locking in a workload that has lots of non-locking work in the pipeline. *That's* the case where I was suggesting it's worth avoiding slowdowns in the un-contended case. It's similar to measuring `divps` slowness with a loop that *only* runs `divps`, vs. a loop that does many dozens of `addps` / `mulps` operations and one `divps`. As long as there is instruction-level parallelism available, `divps` should have low impact on total throughput, since it's just a single port0 uop. – Peter Cordes May 27 '16 at 12:52
Obviously `xchg` has much more impact on the pipeline than `divps`, even when it's not part of the critical path dependency chain. But it can still potentially overlap with other instructions. Certainly with non-memory operations. – Peter Cordes May 27 '16 at 13:00
2

The OP still wrote: "I need to use a lot of locking while copying small structures between threads, for this I have chosen to use spinlocks". Although my code shows it is possible for each of four cores to achieve 2.7 million locks/unlocks per second it is not inconceivable that the OP uses so many of them that his multi-threaded code would be better off single-threaded without locks. – Olof Forshell May 28 '16 at 23:48
Please cut the theoretical argumentation. I've presented real numbers on real machines and you poke at them because you think this or that reasoning is or isn't logical because you have read something somewhere. Do some experimentation of your own instead of guessing. You can start by using my program. It shows the examples I have tested so you don't have to ask me what I have or haven't tested. – Olof Forshell May 28 '16 at 23:52
1

one of the best and most thorough answers i’ve ever read thank you! – Gukki5 Sep 01 '20 at 10:13
1

TLDR; sell your soul to before ever using any lock of any kind no matter how’s it’s implemented, unless you absolutely MUST. – Gukki5 Sep 01 '20 at 10:24
1

@OlofForshell: I finally got around to testing my claim about *uncontended* `xchg` scaling well, i.e. if each thread it taking / releasing a different lock (in a different cache line). https://godbolt.org/z/MEjboaqMf is C++ code that compiles to a loop that just repeats an `xchg [rdi], ecx`. The total runtime of the program on my i7-6700k is 463ms whether it runs on 1 or 4 threads (with different `atomic` objects), so that rules out any kind of system-wide bus lock, confirming that it's just a cache-lock ([Can num++ be atomic for 'int num'?](https://stackoverflow.com/q/39393850)) – Peter Cordes Mar 29 '21 at 18:53
Of course threads contending for the *same* lock slow each other down, and everything you say about minimizing locking in general is good, but the one thing that's not horribly slow is for a thread to release and re-acquire a lock if no other threads want it. – Peter Cordes Mar 29 '21 at 18:53
I had earlier suggested that it would be better to check read-only to see if the lock is available before even trying the first time. That's probably not good, and could lead to a share request to get read-only access before an RFO to get exclusive ownership. See discussion in [Does cmpxchg write destination cache line on failure? If not, is it better than xchg for spinlock?](https://stackoverflow.com/q/63008857) - spinning read-only is good after you've already failed once, but probably not the first time. – Peter Cordes Mar 29 '21 at 18:57
I do think you need to time them. I suggest you should at the very least see really fast uncontended locks, "less fast" locks that have been slowed by system-wide bus contention and "slow" times where an interrupt might have occurred. If every run requires 463 milliseconds it could be as you say or it could be that system-wide contention is rare which could be the case with infrequent bus mastering events. The interrupt loading looks fairly constant. If you time events individually you will either see complete homogenity or you won't, in which case you need to present hypotheses as to why. – Olof Forshell Mar 30 '21 at 19:27
Locks can be directly contended by other threads wanting them or indirectly by a system wide bus contention which has nothing to do with the lock itself, it simply says that the bus cannot vouch for a lock at this moment in time being 100% safe due to devices that may overwrite the memory containing the lock if they aren't first allowed to finish what they are doing. This may sound ludicrous but from a system view the bus lock mechanism cannot and should not be privy to what the executing software is doing - it is simply an extremely blunt and costly but absolutely necessary mechanism. – Olof Forshell Mar 30 '21 at 19:39
Didn't see your comments since you didn't @user address them to me. A core can perform a "cache lock" to do an atomic RMW any time it has MESI exclusive ownership of a cache line. This exclusive ownership is the same thing that's required to commit normal stores to L1d cache, so from an outside perspective there's no visible sign (on any bus outside a CPU core) that a core did an atomic RMW on an aligned object in normal cacheable memory. Only for cache-line split and maybe also for uncacheable memory would a bus lock be necessary. – Peter Cordes Apr 08 '21 at 03:44
Of course, if a core doesn't already have exclusive ownership of a cache line (e.g. after some *other* core just unlocked it), getting exclusive ownership will take maybe 40 to 70 nanoseconds (hundreds of cycles) , depending on the interconnect between cores. I think that's what you're describing as "less fast" locks; normally no "bus lock" is involved. x86 has cache-coherent DMA; bus-mastering by a device writing to a physical address doesn't invalidate cache lines other than one the device is writing to. Bus lock is only needed for rare stuff like line-split locks. – Peter Cordes Apr 08 '21 at 03:50
@Peter Cordes: it's really back to basics, I guess. A multi-threaded application has a memory area m which is updated at different times by the application's different threads. To serialize (avoid simultaneous collisions) the updates a spinlock s is implemented. Successfully locking s gives the thread access to m after it unlocks s. An unsuccessful locking attempt means that the lock must be attempted at a later time. The ultimate location of m will be at some address in RAM. This means that m can be updated from two distinct directions: the CPU direction or the peripheral direction. – Olof Forshell Jan 11 '23 at 16:25
The multithreading application may be implemented such that m is only updated from the CPU direction and, in that case, the spinlock will only be accessed by one (or more) cores. The application may also implement reading data into m from a peripheral. The xchg r,m instruction is unaware (it actually must not be aware) of how the access to m is implemented by the application, its only job is to ensure that a successful lock has the absolute meaning that the application's context is in a state that the application - from its "point of view" - sees as known. – Olof Forshell Jan 11 '23 at 16:36
So if the application expects a known state to mean that no other cores are vying for the spinlock (by extension, m - but xchg r,m does not know this) and that no (perhaps asynchronous) read from a peripheral is outstanding, it indirectly expects xchg r,m to deliver this. That is why xchg r,m by definition cannot only secure the CPU direction: it must also secure the peripheral direction. – Olof Forshell Jan 11 '23 at 16:45
It might be feasible to implement a locking instruction solely for the CPU direction for cases when the application won't update m by an asynchronous read from a peripheral. Until this has been implemented, xchg r,m cannot ignore the peripheral direction because it does not know if this is what the application requires of it. – Olof Forshell Jan 11 '23 at 16:52
So the "MESI exclusive ownership" argument would be valid if the xchg r,m is implemented to only secure the CPU direction. But while such an implementation would cover the needs of many applications, it would not cover the needs of all of them which is what xchg r,m is all about: providing a one-stop conditional solution to delivering the entire system in a known state to whomever requires it. – Olof Forshell Jan 11 '23 at 17:00
So while xchg r,m in the successful case will deliver the entire system in a known state to application and OS alike, it makes no guarantees as to how long it will take. As I have written previously, the fastest spinlock is the one you never even attempt, meaning that if the application is well thought out the need for spinlocks is reduced to a minimum and the risk of an attempt being unsuccessful is minimized. The OP specified an application that was (probably) just about as unsuitable as is possible for using spinlocks. – Olof Forshell Jan 11 '23 at 17:09
We don't know if that application was actually an exercise so that the OP could experience first hand just how expensive spinlocks can be. – Olof Forshell Jan 11 '23 at 17:11
*That is why xchg r,m by definition cannot only secure the CPU direction: it must also secure the peripheral direction.* - Right, and it does this by keeping MESI exclusive ownership like I described. x86 has cache-coherent DMA, so there's no way for a peripheral to access memory without the CPU knowing about it; the system agent has to get exclusive ownership, so the core that owns it has to write-back first. Or to put it another way, access by peripherals respects the cache-lock that `xchg r,m` uses when the memory operand isn't split across cache lines, so a bus lock isn't needed. – Peter Cordes Jan 12 '23 at 00:59

score 6 · Answer 3 · edited May 23 '17 at 10:30

6

Wikipedia has a good article on spinlocks, here is the x86 implementation

http://en.wikipedia.org/wiki/Spinlock#Example_implementation

Notice their implementation doesn't use the "lock" prefix, because it is redundant on x86 for the "xchg" instruction - it implicitly has lock semantics, as discussed in this Stackoverflow discussion:

On a multicore x86, is a LOCK necessary as a prefix to XCHG?

The REP:NOP is an alias for the PAUSE instruction, you can learn more about that here

How does x86 pause instruction work in spinlock *and* can it be used in other scenarios?

On the issue of memory barriers, here's everything you might want to know

Memory Barriers: a Hardware View for Software Hackers by Paul E. McKenney

http://irl.cs.ucla.edu/~yingdi/paperreading/whymb.2010.06.07c.pdf

edited May 23 '17 at 10:30

Community

1
1

answered Aug 14 '12 at 19:57

amdn

11,314
33
45

In the code you linked to the Wikipedia implementation, why does the instruction to set the lock value in memory to `0` have to be atomic (using `xchg` instruction)? It would seem that even if another thread accesses the memory location before it's set to `0`, all it would see is that the lock is still locked - wouldn't a direct `mov` to the memory location work too? – VF1 Jan 31 '14 at 20:43
Yes, a mov works on new x86 processors but not on some older ones, that is explained in the next section titled "Significant Optimizations" `On later implementations of the x86 architecture, spin_unlock can safely use an unlocked MOV instead of the slower locked XCHG. This is due to subtle memory ordering rules which support this, even though MOV is not a full memory barrier. However, some processors (some Cyrix processors, some revisions of the Intel Pentium Pro (due to bugs), and earlier Pentium and i486 SMP systems) will do the wrong thing and data protected by the lock could be corrupted.` – amdn Jan 31 '14 at 21:42
Thanks for pointing that out. If you don't mind me asking, what is the problem avoided by these "subtle memory ordering rules?" – VF1 Jan 31 '14 at 22:29
X86 memory ordering requires that stores by one thread be seen by other threads in the same order, which means there is no way another thread can acquire the lock and *not* see stores done in the critical section before the lock was released. See section (4) of this paper http://www.spinroot.com/spin/Doc/course/x86_tso.pdf – amdn Jan 31 '14 at 22:57
By the way, I found that the fact a MOV is sufficient (no serialization or memory barrier required) was "officially" clarified in a reply to Linus Torvalds by an Intel architect back in 1999 https://lkml.org/lkml/1999/11/24/90 I guess it was later discovered that didn't work for some older x86 processors. – amdn Feb 01 '14 at 00:50

score 3 · Accepted Answer · edited May 23 '17 at 12:10

3

Just look here: x86 spinlock using cmpxchg

And thanks to Cory Nelson

__asm{
spin_lock:
xorl %ecx, %ecx
incl %ecx
spin_lock_retry:
xorl %eax, %eax
lock; cmpxchgl %ecx, (lock_addr)
jnz spin_lock_retry
ret

spin_unlock:
movl $0 (lock_addr)
ret
}

And another source says: http://www.geoffchappell.com/studies/windows/km/cpu/cx8.htm

       lock    cmpxchg8b qword ptr [esi]
is replaceable with the following sequence

try:
        lock    bts dword ptr [edi],0
        jnb     acquired
wait:
        test    dword ptr [edi],1
        je      try
        pause                   ; if available
        jmp     wait

acquired:
        cmp     eax,[esi]
        jne     fail
        cmp     edx,[esi+4]
        je      exchange

fail:
        mov     eax,[esi]
        mov     edx,[esi+4]
        jmp     done

exchange:
        mov     [esi],ebx
        mov     [esi+4],ecx

done:
        mov     byte ptr [edi],0

And here is a discussion about lock-free vs lock implementations: http://newsgroups.derkeiler.com/Archive/Comp/comp.programming.threads/2011-10/msg00009.html

edited May 23 '17 at 12:10

Community

1
1

answered Aug 14 '12 at 19:36

huseyin tugrul buyukisik

11,469
4
45
97

Where would you place the PAUSE (NOP:REP) instruction? inside the loop or before? – sigvardsen Aug 15 '12 at 21:40
in the place of "pause" at the "wait" part. But for older cpu s as far as i know – huseyin tugrul buyukisik Aug 16 '12 at 04:09
This implementation spins on a `lock cmpxchg` rather than an atomic load before trying the exchange. (See Necropolis's answer). `pause` avoids a memory-ordering mis-speculation when you spin on a load instead of a `lock`ed instruction. See [my answer on another question for a simple/minimal asm spinlock implementation that I think avoids any obvious problems](http://stackoverflow.com/questions/37241553/locks-around-memory-manipulation-via-inline-assembly/37246263#37246263). – Peter Cordes May 27 '16 at 13:41
Does this also work for 64-bit? Why does it solely say x86 spinlock? – Goaler444 Feb 24 '17 at 16:57
1

@Goaler444 x86 meaning general architecture name and should also work for x86_64 too with 64bit lock_addr using cmpxchgq instead of cmpxchgl – huseyin tugrul buyukisik Feb 24 '17 at 18:01

score -1 · Answer 5 · answered Aug 14 '12 at 20:19

Just asking:

Before you dig that deep into spinlock and nearly-lockless data structures:

Have you - in your benchmarks and your application - made sure that the competing threads are guaranteed to run on different cores?

If not you may end up with a program that works great on your development machine but sucks/fails hard in the field because one thread has to be both the locker and unlocker of your spinlock.

To give you a figure: On Windows you have standard time-slice of 10 milliseconds. If you don't make sure that two physical threads are involved in locking/unlocking you'll end up with around 500 locks/unlocks per second, and this result will be very meh

Of course one thread has to lock/unlock the spin lock. Or else it won't protect anything. And there's no reason why a thread will acquire the lock, be descheduled and then unlock for each time slice. — dave, Aug 15 '12 at 05:15

Fastest inline-assembly spinlock

5 Answers5

Linked