A lock-free mutex is a contradiction.
You can build a lock out of lock-free building-blocks, and in fact that's the normal thing to do whether it's hand-written in asm or with std::atomic
.
But the overall locking algorithm is by definition not lock-free. (https://en.wikipedia.org/wiki/Non-blocking_algorithm). The entire point is to stop other threads from making forward progress while one thread is in the critical section, even if it unfortunately sleeps while it's there.
I mean that the implementation could be lock free on the happy path (locking from not locked state)
std::mutex::lock()
is that way too: it doesn't block if it doesn't have to! It might need to make a system call like futex(FUTEX_WAIT_PRIVATE)
if there's a thread waiting for the lock. But so does an implementation that used std::notify
.
Perhaps you haven't understood what "lock-free" means: it never blocks, regardless of what other threads are doing. That is all. It doesn't mean "faster in the simple/easy case". For a complex algorithm (e.g. a queue), it's often faster to just give up and block if the circular buffer is full, rather than adding overhead to the simple case to allow other threads to "assist" or cancel a stuck partial operation. (Lock-free Progress Guarantees in a circular buffer queue)
There is no inherent advantage to rolling your own std::mutex
out of std::atomic
. The compiler-generated machine code has to do approximately the same things either way, and I'd expect the fast path to be about the same. The only difference would be in choice of what to do in the already-locked case. (But maybe tricky to design a way to avoid calling notify
iff there were no waiters, something that actual std::mutex
manages in the glibc / pthreads implementation Linux.)
(I'm assuming the overhead of the library function call is negligible compared to the cost of an atomic RMW to take the mutex. Having that inline into your code is a pretty minor advantage.)
A mutex implementation can be tuned for certain use-cases, in terms of how long it spin-waits before sleeping (using an OS-assisted mechanism like futex
to enable other threads to wake it when releasing the lock), and in exponential backoff for the spin-wait portion.
If std::mutex
doesn't perform well for your application on the hardware you care about, it's worth considering an alternative. Although IDK exactly how you'd go about measuring whether it worked well or not. Perhaps if you could figure out that it was deciding to sleep
And yes, you could consider rolling your own with std::atomic
now that there's a portable mechanism to hopefully expose a way to fall-back to OS-assisted sleep/wake mechanisms like futex
. You'd still want to manually use system-specific things like x86 _mm_pause()
inside a spin-wait loop, though, since I don't think C++ has anything equivalent to Rust's std::hint::spin_loop()
that an implementation can use to expose things like the x86 pause
instruction, intended for use in the body of a spin-loop. (See Locks around memory manipulation via inline assembly re: such considerations, and spinning read-only instead of spamming atomic RMW attempts. And for a look at the necessary parts of a spinlock in x86 assembly language, which are the same whether you get a C++ compiler to generate that machine code for you or not.)
See also https://rigtorp.se/spinlock/ re: implementing a mutex in C++ with std::atomic.
Linux / libstdc++ behaviour of notify/wait
I tested what system calls std::wait
makes when it has to wait for a long time, on Arch Linux (glibc 2.33).
std::mutex
lock no-contention fast path, and unlock with no waiters: zero system calls, purely user-space atomic operations. Notably is able to detect that there are no waiters when unlocking, so it doesn't make a FUTEX_WAKE
system call (which otherwise would maybe a hundred times more than taking and releasing an uncontended mutex that was still hot in this cores L1d cache.)
std::mutex
lock()
on already locked: only a futex(0x55be9461b0c0, FUTEX_WAIT_PRIVATE, 2, NULL)
system call. Possibly some spinning in user-space before that; I didn't single-step into it with GDB, but if so probably with a pause
instruction.
std::mutex
unlock()
with a waiter: uses futex(0x55ef4af060c0, FUTEX_WAKE_PRIVATE, 1) = 1
. (After an atomic RMW, IIRC; not sure why it doesn't just use a release store.)
std::notify_one
: always a futex(address, FUTEX_WAKE, 1)
even if there are no waiters, so it's up to you to avoid it if there are no waiters when you unlock a lock.
std::wait
: spinning a few times in user-space, including 4x sched_yield()
calls before a futex(addr, FUTEX_WAIT, old_val, NULL)
.
Note the use of FUTEX_WAIT
instead of FUTEX_WAIT_PRIVATE
by the wait
/notify
functions: these should work across processes on shared memory. The futex(2)
man page says the _PRIVATE
versions (only for threads of a single process) allow some additional optimizations.
I don't know about other systems, although I've heard some kinds of locks on Windows / MSVC have to suck (always a syscall even on the fast path) for backwards-compat with some ABI choices, or something. Like perhaps std::lock_guard
is slow on MSVC, but std::unique_lock
isn't?
Test code:
#include <atomic>
#include <thread>
#include <unistd.h> // for quick & dirty sleep and usleep. TODO: port to non-POSIX.
#include <mutex>
#if 1 // test std::atomic
//std::atomic_unsigned_lock_free gvar;
//std::atomic_uint_fast_wait_t gvar;
std::atomic<unsigned> gvar;
void waiter(){
volatile unsigned sink;
while (1) {
sink = gvar;
gvar.wait(sink); // block until it's not the old value.
// on Linux/glibc, 4x sched_yield(), then futex(0x562c3c4881c0, FUTEX_WAIT, 46, NULL ) or whatever the current counter value is
}
}
void notifier(){
while(1){
sleep(1);
gvar.store(gvar.load(std::memory_order_relaxed)+1, std::memory_order_relaxed);
gvar.notify_one();
}
}
#else
std::mutex mut;
void waiter(){
unsigned sink = 0;
while (1) {
mut.lock(); // Linux/glibc2.33 - just a futex system call, maybe some spinning in user-space first. But no sched_yield
mut.unlock();
sink++;
usleep(sink); // give the other thread plenty of time to take the lock, so we don't get it twice.
}
}
void notifier(){
while(1){
mut.lock();
sleep(1);
mut.unlock();
}
}
#endif
int main(){
std::thread t (waiter); // comment this to test the no-contention case
notifier();
// loops forever: kill it with control-C
}
Compile with g++ -Og -std=gnu++20 notifywait.cpp -pthread
, run with strace -f ./a.out
to see the system calls. (A couple or a few per second, since I used a nice long sleep.)
If there's any spin-waiting in user-space, it's negligible compared to the 1 second sleep interval, so it uses about a millisecond of CPU time including startup, to run for 19 iterations. (perf stat ./a.out
)
Usually your time would be better spent trying to reduce the amount of locking involved, or the amount of contention, rather than trying to optimize the locks themselves. Locking is an extremely important thing, and lots of engineering has gone into tuning it for most use-cases.
If you're rolling your own lock, you probably want to get your hands dirty with system-specific stuff, because it's all a matter of tuning choices. Different systems are unlikely to have made the same tuning choices for std::mutex
and wait
as Linux/glibc. Unless std::wait
's retry strategy on the only system you care about happens to be perfect for your use-case.
It doesn't make sense to roll your own mutex without first investigating exactly what std::mutex
does on your system, e.g. single-step the asm for the already-locked case to see what retries it makes. Then you'll have a better idea whether you can do any better.