C++20 mutex with atomic wait

Question

From C++20 std::atomics have wait and notify operations. With is_always_lock_free we can ensure that the implementation is lock free. With these bricks building a lock-free mutex is not so difficult. In trivial cases locking would be a compare exchange operation, or a wait if the mutex is locked. The big question here is if it is worth it or not. If I can create such an implementation most probably the STL version is much better and faster. However I still remember how surprised I was when I saw how QMutex outperformed std::mutex QMutex vs std::mutex in 2016. So what do you think, should I experiment with such an implementation or the current implementation of std::mutex is matured enough to be optimized far beyond these tricks?

UPDATE My wording wasn't the best, I mean that the implementation could be lock free on the happy path (locking from not locked state). Of course we should be blocked and re-scheduled if we need to wait to acquire the lock. Most probably atomic::wait is not implemented by a simple spinlock on most of the platforms (let's ignore the corner cases now), so basically it achieves the very same thing mutex::lock does. So basically if I implement such a class it would do exactly the same std::mutex does (again on most of the popular platforms). It means that STL could use the same tricks in their mutex implementations on the platforms that support these tricks. Like this spinlock, but I would use atomic wait instead of spinning. Should I trust my STL implementation that they did so?

Another factor to consider is that the size of `std::mutex` is often large (80 bytes on MSVC x64, 40 bytes on godbolt.org gcc/clang x64), while I believe that you can build a simple mutex out of a `std::atomic_uchar`, which occupies only one byte. — zwhconst, Mar 13 '23 at 05:08

Peter Cordes · Accepted Answer · 2022-07-07T11:25:04.827

A lock-free mutex is a contradiction.

You can build a lock out of lock-free building-blocks, and in fact that's the normal thing to do whether it's hand-written in asm or with std::atomic.

But the overall locking algorithm is by definition not lock-free. (https://en.wikipedia.org/wiki/Non-blocking_algorithm). The entire point is to stop other threads from making forward progress while one thread is in the critical section, even if it unfortunately sleeps while it's there.

I mean that the implementation could be lock free on the happy path (locking from not locked state)

std::mutex::lock() is that way too: it doesn't block if it doesn't have to! It might need to make a system call like futex(FUTEX_WAIT_PRIVATE) if there's a thread waiting for the lock. But so does an implementation that used std::notify.

Perhaps you haven't understood what "lock-free" means: it never blocks, regardless of what other threads are doing. That is all. It doesn't mean "faster in the simple/easy case". For a complex algorithm (e.g. a queue), it's often faster to just give up and block if the circular buffer is full, rather than adding overhead to the simple case to allow other threads to "assist" or cancel a stuck partial operation. (Lock-free Progress Guarantees in a circular buffer queue)

There is no inherent advantage to rolling your own std::mutex out of std::atomic. The compiler-generated machine code has to do approximately the same things either way, and I'd expect the fast path to be about the same. The only difference would be in choice of what to do in the already-locked case. (But maybe tricky to design a way to avoid calling notify iff there were no waiters, something that actual std::mutex manages in the glibc / pthreads implementation Linux.)

(I'm assuming the overhead of the library function call is negligible compared to the cost of an atomic RMW to take the mutex. Having that inline into your code is a pretty minor advantage.)

A mutex implementation can be tuned for certain use-cases, in terms of how long it spin-waits before sleeping (using an OS-assisted mechanism like futex to enable other threads to wake it when releasing the lock), and in exponential backoff for the spin-wait portion.

If std::mutex doesn't perform well for your application on the hardware you care about, it's worth considering an alternative. Although IDK exactly how you'd go about measuring whether it worked well or not. Perhaps if you could figure out that it was deciding to sleep

And yes, you could consider rolling your own with std::atomic now that there's a portable mechanism to hopefully expose a way to fall-back to OS-assisted sleep/wake mechanisms like futex. You'd still want to manually use system-specific things like x86 _mm_pause() inside a spin-wait loop, though, since I don't think C++ has anything equivalent to Rust's std::hint::spin_loop() that an implementation can use to expose things like the x86 pause instruction, intended for use in the body of a spin-loop. (See Locks around memory manipulation via inline assembly re: such considerations, and spinning read-only instead of spamming atomic RMW attempts. And for a look at the necessary parts of a spinlock in x86 assembly language, which are the same whether you get a C++ compiler to generate that machine code for you or not.)

See also https://rigtorp.se/spinlock/ re: implementing a mutex in C++ with std::atomic.

Linux / libstdc++ behaviour of notify/wait

I tested what system calls std::wait makes when it has to wait for a long time, on Arch Linux (glibc 2.33).

std::mutex lock no-contention fast path, and unlock with no waiters: zero system calls, purely user-space atomic operations. Notably is able to detect that there are no waiters when unlocking, so it doesn't make a FUTEX_WAKE system call (which otherwise would maybe a hundred times more than taking and releasing an uncontended mutex that was still hot in this cores L1d cache.)
std::mutex lock() on already locked: only a futex(0x55be9461b0c0, FUTEX_WAIT_PRIVATE, 2, NULL) system call. Possibly some spinning in user-space before that; I didn't single-step into it with GDB, but if so probably with a pause instruction.
std::mutex unlock() with a waiter: uses futex(0x55ef4af060c0, FUTEX_WAKE_PRIVATE, 1) = 1. (After an atomic RMW, IIRC; not sure why it doesn't just use a release store.)
std::notify_one: always a futex(address, FUTEX_WAKE, 1) even if there are no waiters, so it's up to you to avoid it if there are no waiters when you unlock a lock.
std::wait: spinning a few times in user-space, including 4x sched_yield() calls before a futex(addr, FUTEX_WAIT, old_val, NULL).

Note the use of FUTEX_WAIT instead of FUTEX_WAIT_PRIVATE by the wait/notify functions: these should work across processes on shared memory. The futex(2) man page says the _PRIVATE versions (only for threads of a single process) allow some additional optimizations.

I don't know about other systems, although I've heard some kinds of locks on Windows / MSVC have to suck (always a syscall even on the fast path) for backwards-compat with some ABI choices, or something. Like perhaps std::lock_guard is slow on MSVC, but std::unique_lock isn't?

Test code:

#include <atomic>
#include <thread>
#include <unistd.h>   // for quick & dirty sleep and usleep.  TODO: port to non-POSIX.
#include <mutex>

#if 1        // test std::atomic
//std::atomic_unsigned_lock_free gvar;
//std::atomic_uint_fast_wait_t gvar;
std::atomic<unsigned> gvar;

void waiter(){
    volatile unsigned sink;
    while (1) {
        sink = gvar;
        gvar.wait(sink);   // block until it's not the old value.
        // on Linux/glibc, 4x sched_yield(), then futex(0x562c3c4881c0, FUTEX_WAIT, 46, NULL )  or whatever the current counter value is
    }
}

void notifier(){
    while(1){
        sleep(1);
        gvar.store(gvar.load(std::memory_order_relaxed)+1, std::memory_order_relaxed);
        gvar.notify_one();
    }
}
#else
std::mutex mut;

void waiter(){
    unsigned sink = 0;
    while (1) {
        mut.lock();     // Linux/glibc2.33 - just a futex system call, maybe some spinning in user-space first. But no sched_yield
        mut.unlock();
        sink++;
        usleep(sink);  // give the other thread plenty of time to take the lock, so we don't get it twice.
    }
}

void notifier(){
    while(1){
        mut.lock();
        sleep(1);
        mut.unlock();
    }
}
#endif

int main(){
    std::thread t (waiter);   // comment this to test the no-contention case
    notifier();
    // loops forever: kill it with control-C
}

Compile with g++ -Og -std=gnu++20 notifywait.cpp -pthread, run with strace -f ./a.out to see the system calls. (A couple or a few per second, since I used a nice long sleep.)

If there's any spin-waiting in user-space, it's negligible compared to the 1 second sleep interval, so it uses about a millisecond of CPU time including startup, to run for 19 iterations. (perf stat ./a.out)

Usually your time would be better spent trying to reduce the amount of locking involved, or the amount of contention, rather than trying to optimize the locks themselves. Locking is an extremely important thing, and lots of engineering has gone into tuning it for most use-cases.

If you're rolling your own lock, you probably want to get your hands dirty with system-specific stuff, because it's all a matter of tuning choices. Different systems are unlikely to have made the same tuning choices for std::mutex and wait as Linux/glibc. Unless std::wait's retry strategy on the only system you care about happens to be perfect for your use-case.

It doesn't make sense to roll your own mutex without first investigating exactly what std::mutex does on your system, e.g. single-step the asm for the already-locked case to see what retries it makes. Then you'll have a better idea whether you can do any better.

Thank you for your answer, please check my update. As I see we have (at least) two different implementations to block a thread, one is with std::mutex and one is with atomic::wait. Do you think it is worth it to spend some time and try to create my own mutex implementation based on atomic::wait, or most probably the STL implementation already utilized these tricks in std::mutex? — Broothy, Jul 07 '22 at 06:17
@Broothy: `std::mutex` uses the same `futex` syscall as a fallback, and some tuning choice of how long to spin for before giving the CPU back to the OS, and the details. It might be a different tuning choice than for `std::atomic::notify`, which on Linux with glibc 2.33 makes 4x `sched_yield()` calls before `futex(FUTEX_WAIT)`. — Peter Cordes, Jul 07 '22 at 06:28
@Broothy: Just tested `std::mutex`, too (https://godbolt.org/z/Gdo9xs6MW): it doesn't do any `sched_yield()` calls before `futex` if the lock is already taken. If you're rolling your own lock, you probably want to get your hands dirty with system-specific stuff, because it's all a matter of *tuning* choices. Different systems are unlikely to have made the same tuning choices for `std::mutex` and `wait` as Linux/glibc. Unless `std::wait`'s retry strategy on the only system you care about happens to be perfect for your use-case. — Peter Cordes, Jul 07 '22 at 06:37
But usually your time would be better spent trying to reduce the amount of locking involved, or the amount of contention, rather than trying to optimize the locks themselves. Locking is an extremely important thing, and lots of engineering has gone into making it good for most use-cases. (At least on Linux; I've heard some horror stories about MSVC's default implementation of lock functions that have to suck for backwards-compat with some ABI choices, or something. Like not retrying at all in user-space.) — Peter Cordes, Jul 07 '22 at 06:40
@PeterCordes Great, detailed answer. There are still reasons to roll one's own lock implementation, though, particularly for spinlocks. For example, maybe you want to use hardware-lock elision (not that you can do that in portable C++), maybe you need (or don't need) fairness, maybe you care about (or don't care about) memory traffic on your interconnect. Most programmers shouldn't need to do this, though. — user3188445, Jul 07 '22 at 07:39
@user3188445: Is there still any hardware with HLE that hasn't been disabled by microcode updates? RTM (manual `xbegin`/`xend`) is still enableable on current CPUs I think, but last I read, HLE was permanently disabled by microcode on all Intel CPUs (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c2955f270a84762343000f103e0640d29c7a96f3). I don't follow other ISAs in enough detail to know if POWER or AArch64 have HLE. But yeah, if you only care about CPUs that have RTM, you could possibly convert your lock acquire/reload to transaction begin/end for limited cases — Peter Cordes, Jul 07 '22 at 07:55

score 0 · Answer 2 · answered Jul 06 '22 at 21:43

0

You may want to distinguish between a mutex (which is generally a sleeping lock that interacts with the scheduler) and a spinlock (which does not put the current thread to sleep, and makes sense only when a thread on a different CPU is likely to be holding the lock).

Using C++20 atomics, you can definitely implement a spinlock, but this won't be directly comparable to std::mutex, which puts the current thread to sleep. Mutexes and spinlocks are useful in different situations. When successful, the spinlock is probably going to be faster--after all the mutex implementation likely contains a spinlock. It's also the only kind of lock you can acquire in an interrupt handler (though this is less relevant to user-level code). But if you hold the spinlock for a long time and there is contention, you will waste a huge amount of CPU time.

answered Jul 06 '22 at 21:43

user3188445

4,062
16
26

What prevents `std::atomic::wait` from "interacting with a scheduler"? If that's the distinction you're drawing. – Jeff Garrett Jul 06 '22 at 22:29
You can *try* to take a spinlock in an interrupt handler, but if it's already taken you can't always safely wait for it. On a uniprocessor system, the code that holds the lock will never regain control to finish its critical section and unlock if this interrupt handler interrupted it. (On an SMP system, user-space could eventually get scheduled to another core. Or with a preempt-able kernel, the interrupted kernel code could eventually run somewhere else. But only if there are multiple cores, so an interrupt handler that spin-waits for a lock is a deadlock waiting to happen on a 1-core VM) – Peter Cordes Jul 07 '22 at 00:18
@PeterCordes Spinlocks are the only kind of synchronization available in interrupt handlers, because you can't sleep in an interrupt handler. You do have to disable interrupts before acquiring such a spinlock, however, as otherwise as you point out you could end up trying to acquire the spinlock in an interrupt handler running on the same CPU as the thread that was just interrupted, and you will get deadlock. – user3188445 Jul 07 '22 at 03:32
@user3188445: Ok yeah, if we're talking about a lock that's only ever taken by other kernel code with interrupts disabled, then yeah the spinlock gives mutual exclusion from other cores on an SMP system. But disabling interrupts alone is enough to give mutual exclusion wrt. other code running on this core. On a uniprocessor system, you don't need the spinlock, just disabling interrupts around a critical section. – Peter Cordes Jul 07 '22 at 03:49
The user-space equivalent of this is a signal handler, but it's much less convenient for non-signal-handler code to disable signal handlers before taking a lock. (You'd be equally screwed if a signal handler that tried to take a lock ran in the context of a thread that had just taken a lock.) – Peter Cordes Jul 07 '22 at 03:51
@user3188445 Are you completely sure that atomic wait is implemented with a simple spin lock? As I know it can fallback to spinlocking, but in ideal cases they are going to use futexes or other sophisticated techniques. Of course there are corner cases, but I would like to target the popular platforms like x86_64 Win/Linux and arm64 Linux e.g. On those platforms I think atomic wait will have an implementation that makes it directly comparable with std::mutex. – Broothy Jul 07 '22 at 05:39
@Broothy: A good implementation of `wait` and `notify` *should* be using something like Linux `futex`; I assume the point of `wait` and `notify` is to expose that functionality via a portable API. You can check if you want; write a program that sleeps for a whole second or so before calling `notify`, and see if `wait`ing threads are burning CPU time or sleeping. Also you can use `strace -f ./a.out` on Linux to see system calls in each thread. Of course a spin-wait is a *possible* implementation. – Peter Cordes Jul 07 '22 at 05:59
@PeterCordes Unfortunately, I can't tell from the description whether atomic::wait is intended to expose monitor/mwait type primitives or futexes. So maybe if you play with it more, you can actually report what it does by disassembling your code. Given that it's intended for user-level code, it may be that it calls into the futexes. I find it strange that it is then grouped with atomics rather than the thread primitives, but I still think it (the futex theory) makes the most sense compared to any other theories I can think of right now. – user3188445 Jul 07 '22 at 06:03
@user3188445 Let's assume that wait/notify has a good implementation :) In this case std::mutex could use that implementation, right? This is why I ask if I should trust my STL implementation and suppose that they did every trick they could to have the most optimal mutex implementation, or I should invest some time any try to 'beat' std::mutex. – Broothy Jul 07 '22 at 06:07
@Broothy: I got curious and tested it with this source (https://godbolt.org/z/MW7WTPKfe) on my Linux machine. wait makes 4x `sched_yield()` system calls before calling `futex(FUTEX_WAIT)`. Average CPU usage is essentially zero (task-clock = 1.7ms after 19 seconds), making 5 system calls per second in the waiter and 1 `clock_nanosleep` in the notifier, and not much spinning. My hardware doesn't have `umonitor` / `umwait` (Tremont or Alder Lake), but even if it were available I doubt it would use it for long. Probably only as long as it would have spent spinning with `pause` and a normal load. – Peter Cordes Jul 07 '22 at 06:23
The kernel can of course use privileged `monitor`/`mwait` if it decides to sleep for the `futex` syscall; that's presumably part of why `futex` takes an address to monitor. – Peter Cordes Jul 07 '22 at 06:25
1

@Broothy and user318... - updated my answer with test code and results for what system calls we see on Linux. You could single-step into it with GDB if you want to see what actual asm instructions run. – Peter Cordes Jul 07 '22 at 07:20

C++20 mutex with atomic wait

2 Answers2

Linux / libstdc++ behaviour of notify/wait