8

What are the benefits of using a specifically designed spinlock (e.g. http://anki3d.org/spinlock) vs. code like this:

std::mutex m;
while (!m.try_lock()) {}
# do work
m.unlock();
Smita Ahinave
  • 1,901
  • 7
  • 23
  • 42
user2411693
  • 533
  • 4
  • 14
  • 2
    Why not just `m.lock();`? What's the point of the busy wait? – Igor Tandetnik Feb 11 '16 at 05:10
  • 1
    In this case, I have many idle cores and want to minimizing the latency in which each thread starts work. Felt like a spinlock was what I needed, but I didn't understand why using a mutex for this wasn't common and I kept seeing different homegrown implementations of spinlocks instead. – user2411693 Feb 11 '16 at 05:18
  • 2
    I suspect that `try_lock()` is about as expensive as `lock()`, and you are getting the worst of both worlds - the cost of a trip to the kernel, and using your CPU as a space heater. – Igor Tandetnik Feb 11 '16 at 05:33
  • So the benefit of a hand-rolled spinlock is that it's guaranteed to spin in userspace? And if try_lock require a system call, then I'll be losing context and paying the latency anyways? – user2411693 Feb 11 '16 at 05:38
  • That's my feeling, yes. Not that I did any performance measurements or anything. – Igor Tandetnik Feb 11 '16 at 05:40
  • 3
    @user2411693 If you're going to assume your platform's spinlock implementation is so bad that even the most naive implementation is better, you should assume that about other things your platform provides too and just abandon the platform. – David Schwartz Feb 11 '16 at 06:06

2 Answers2

13

On typical hardware, there are massive benefits:

  1. Your naive "fake spinlock" may saturate internal CPU buses while the CPU spins, starving other physical cores including the physical core that holds the lock.

  2. If the CPU supports hyper-threading or something similar, your naive "fake spinlock" may consume excessive execution resources on the physical core, starving another thread sharing that physical core.

  3. Your naive "fake spinlock" probably does extraneous write operations that result in bad cache behavior. When you perform a read-modify-write operation on an x86/x86_64 CPU (like the compare/exchange that try_lock probably does), it always writes even if the value isn't changed. This write causes the cache line to be invalidated on other cores, requiring them to re-share it when another core accesses that line. This is awful if threads on other cores contend for the same lock at the same time.

  4. Your naive "fake spinlock" interacts badly with branch prediction. When you finally do get the lock, you take the mother of all mispredicted branches right at the point where you are locking out other threads and need to execute as quickly as possible. This is like a runner being all pumped up and ready to run at the starting line but then when he hears the starting pistol, he stops to catch his breath.

Basically, that code does everything wrong that it is possible for a spinlock to do wrong. Absolutely nothing is done efficiently. Writing good synchronization primitives requires deep hardware expertise.

David Schwartz
  • 179,497
  • 17
  • 214
  • 278
  • I follow #1, #2, and #4. But I don't understand #3. Could you elaborate a bit. Appreciate how decisively you've pointed it out it's a bad idea. Lesson learned, will replace all such uses with proper spinlocks. – user2411693 Feb 11 '16 at 06:09
  • 2
    When you perform a read-modify-write operation on an x86/x86_64 CPU (like the compare/exchange that try_lock probably does), it always writes even if the value isn't changed. This write causes the cache line to be invalidated for other cores, requiring them to re-share it when another core accesses that line. (Note that all of these may not apply on every platform.) – David Schwartz Feb 11 '16 at 06:12
  • Thanks. That's really helpful. Do you recommend any resources (paper or digital) on profiling and optimizing code with these things in mind? – user2411693 Feb 11 '16 at 06:23
  • 1
    A decade of experience. Seriously, I'm not really sure how to learn this stuff other than the hard way. – David Schwartz Feb 11 '16 at 06:36
  • 1
    I think a lot of your points (probably all) apply equaly to the linked "true" spinlock. – MikeMB Feb 12 '18 at 10:40
  • @MikeMB I hope not. If they do, you should file bug reports! Synchronization primitives should be implemented by platform experts who don't make these kinds of mistakes. They should have a deep understand of what they're making the CPU do and the CPU manufacturer's recommended strategies to avoid them. (All of these are avoidable on x86 CPUs if you follow Intel and AMD's recommendations.) – David Schwartz Feb 12 '18 at 15:10
  • 1
    @DavidSchwartz: Have you had a look at the link? It doesn't link to a library, but to a blogpost that shows a minimal example of a spinlock, which is almost identical to the answer I gave here: https://stackoverflow.com/questions/26583433/c11-implementation-of-spinlock-using-atomic/29195378#29195378 (the blog seems to preceed my answer but I'm sure I didn't know it back then). Depending on how try_lock is implemented it should still be more efficient than what the OP posted here, but it doesn't have any provisions against false branch prediction or ht-thread starvation. – MikeMB Feb 12 '18 at 16:00
  • @MikeMB You're correct. The spinlock in the blogpost has all those mistakes. Shocking, IMO. These are not difficult problems to solve on most platforms. For example, on x86, there are simple workarounds. (I don't know any way to solve them in a platform-independent way though. Synchronization platforms just *are* platform-specific, period.) These are all fake spinlocks that do not perform at all the way the real spinlocks on your platform do -- I hope. – David Schwartz Feb 15 '18 at 17:43
2

The main benefit of using a spinlock is that it is extremely cheap to acquire and release if the all-important precondition is true: There is little or no congestion on the lock.

If you know with sufficient certitude that there will be no contention, a spinlock will greatly outperform a naive implementation of a mutex which will go through library code doing validations that you don't necessarily need, and do a syscall. This means doing a context switch (consuming several hundreds of cycles), and abandoning the thread's time slice and causing your thread to be rescheduled. This may take an indefinite time -- even if the lock would be available almost immediately afterwards, you can still have to wait several dozen milliseconds before your thread runs again in unfavorable conditions.

If, however, the precondition of no contention does not hold, a spinlock will usually be vastly inferior as it makes no progress, but it still consumes CPU resources as if it was performing work. When blocking on a mutex, your thread does not consume CPU resources, so these can be used for a different thread to do work, or the CPU may throttle down, saving power. That's not possible with a spinlock, which is doing "active work" until it succeeds (or fails).
In the worst case, if the number of waiters is greater than the number of CPU cores, spinlocks may cause huge, dysproportionate performance impacts because the threads that are active and running are waiting on a condition that can never happen while they are running (since releasing the lock requires a different thread to run!).

On the other hand, one should expect every modern no-suck implementation of std::mutex to already include a tiny spinlock before falling back to doing a syscall. But... while it is a reasonable assumption, this is not guaranteed.

Another non-technical reason for using spinlocks in favor of a std::mutex may be license terms. License terms are a poor rationale for a design decision, but they may nevertheless be very real.
For example, the present GCC implementation is based exclusively on pthreads, which implies that "anything MinGW" using anything from the standard threads library necessarily links with winpthreads (lacking alternatives). That means you are subject to the winpthreads license, which implies you must reproduce their copyright message. For some people, that's a dealbreaker.

Damon
  • 67,688
  • 20
  • 135
  • 185