Essentially, my question is:
What does an "good" implementation of a spinlock look like in c++ which works on the "usual" CPU/OS/Compiler combinations (x86 & arm, Windows & Linux, msvc & clang & g++ (maybe also icc) ).
Explanation:
As I wrote in the answer to a different question, it is fairly easy to write a working spinlock in c++11. However, as pointed out (in the comments as well as in e.g. spinlock-vs-stdmutextry-lock), such an implementation comes with some performance problems in case of congestion, which imho can only be solved by using platform specific instructions (intrinsics / os primitives / assembly?).
I'm not looking for a super optimized version (I expect that would only make sense if you have very precise knowledge about the exact platform and workload and need every last bit of efficiency) but something that lives around the mythical 20/80 tradeoff point i.e. I want to avoid the most important pitfalls in most cases while still keeping the solution as simple and understandable as possible.
In general, I'd expect the result to look something like thist:
#include <atomic>
#ifdef _MSC_VER
#include <Windows.h>
#define YIELD_CPU YieldProcessor();
#elif defined(...)
#define YIELD_CPU ...
...
#endif
class SpinLock {
std::atomic_flag locked = ATOMIC_FLAG_INIT;
public:
void lock() {
while (locked.test_and_set(std::memory_order_acquire)) {
YIELD_CPU;
}
}
void unlock() {
locked.clear(std::memory_order_release);
}
};
But I don't know
if a YIELD_CPU macro inside the loop is all that's needed or if there are any other problematic aspects (e.g. can/should we indicate if we expect the test_and_set to succeed most of the time)
what the appropriate mapping for
YIELD_CPU
on the different CPU/OS/Compiler combinations is (and if possible I'd like to avoid dragging in a heavy weight header likeWindows.h
)
Note: I'm also interested in answers that only cover a subset of the mentioned platforms, but might not mark them as the accepted answer and/or merge them into a separate community answer.