How to properly implement a cross-platform spinlock in c++

Question

Essentially, my question is:

What does an "good" implementation of a spinlock look like in c++ which works on the "usual" CPU/OS/Compiler combinations (x86 & arm, Windows & Linux, msvc & clang & g++ (maybe also icc) ).

Explanation:
As I wrote in the answer to a different question, it is fairly easy to write a working spinlock in c++11. However, as pointed out (in the comments as well as in e.g. spinlock-vs-stdmutextry-lock), such an implementation comes with some performance problems in case of congestion, which imho can only be solved by using platform specific instructions (intrinsics / os primitives / assembly?).

I'm not looking for a super optimized version (I expect that would only make sense if you have very precise knowledge about the exact platform and workload and need every last bit of efficiency) but something that lives around the mythical 20/80 tradeoff point i.e. I want to avoid the most important pitfalls in most cases while still keeping the solution as simple and understandable as possible.

In general, I'd expect the result to look something like thist:

#include <atomic>

#ifdef _MSC_VER
    #include <Windows.h>
    #define YIELD_CPU YieldProcessor();
#elif defined(...)
    #define YIELD_CPU ...
...
#endif

class SpinLock {
    std::atomic_flag locked = ATOMIC_FLAG_INIT;
public:
    void lock() {
        while (locked.test_and_set(std::memory_order_acquire)) {
            YIELD_CPU;
        }
    }
    void unlock() {
        locked.clear(std::memory_order_release);
    }
};

But I don't know

if a YIELD_CPU macro inside the loop is all that's needed or if there are any other problematic aspects (e.g. can/should we indicate if we expect the test_and_set to succeed most of the time)
what the appropriate mapping for YIELD_CPU on the different CPU/OS/Compiler combinations is (and if possible I'd like to avoid dragging in a heavy weight header like Windows.h)

Note: I'm also interested in answers that only cover a subset of the mentioned platforms, but might not mark them as the accepted answer and/or merge them into a separate community answer.

You may be interested in http://en.cppreference.com/w/cpp/thread/yield — François Andrieux, Feb 12 '18 at 16:17
You might be interested to know that most built in locks will spin when you lock them for a little while... — UKMonkey, Feb 12 '18 at 16:23
The unlock path might be more expensive, but yes - the 80:20 tradeoff is to just use `std::mutex` and replace it only where and when it actually proves to be a problem. — Useless, Feb 12 '18 at 16:34
@FrançoisAndrieux: Thanks, but no. tha function will reschedule your thread, which is 99% of the time not what you want when you use a spinlock. — MikeMB, Feb 12 '18 at 16:49
@UKMonkey: The problem is that last time I checked, std::mutex has a significant overhead in the non-contested case which I try to avoid. — MikeMB, Feb 12 '18 at 16:51
"cross platform" and "spin lock" do not sound like phrases that belong together. — Solomon Slow, Feb 12 '18 at 16:59
Which platforms did you check `std::mutex` on? All of the above? Which operations were significantly slower? — Useless, Feb 12 '18 at 17:09
Anyway, if you want to write a cross-platform spinlock like this, I'd look up the already-available optimal spinlocks for each platform first, and _then_ factor out the commonality. — Useless, Feb 12 '18 at 17:11
The given implementation does not spin. It tests once and then yields, which is an OS call. — Jive Dadson, Feb 12 '18 at 17:41
@JiveDadson: Please look up the difinition of [YieldProcessor](https://msdn.microsoft.com/de-de/library/windows/desktop/ms687419(v=vs.85).aspx) — MikeMB, Feb 12 '18 at 17:50
EDIT: I deleted some of my comments, where I got a little emotional. Sorry for that. — MikeMB, Feb 12 '18 at 18:18
@Useless: I checked on x64/x86 Windows with msvc (2017) and Ubuntu with g++-5.4 and clang-6.0 as well as on a embedded ARM Cortex-A9 platform running linux compiled with a 4.9 g++ cross-compiler toolchain. On both platforms, my naive spin_lock implementation (without any yield statements) worked significantly better than std::mutex (2x - 10x in microbenchmarks). What specific operations are you referring to - what operations I protect with that lock? — MikeMB, Feb 12 '18 at 18:19

How to properly implement a cross-platform spinlock in c++

0 Answers0