Why is a pthread mutex considered "slower" than a futex?

Question

Why are POSIX mutexes considered heavier or slower than futexes? Where is the overhead coming from in the pthread mutex type? I've heard that pthread mutexes are based on futexes, and when uncontested, do not make any calls into the kernel. It seems then that a pthread mutex is merely a "wrapper" around a futex.

Is the overhead simply in the function-wrapper call and the need for the mutex function to "setup" the futex (i.e., basically the setup of the stack for the pthread mutex function call)? Or are there some extra memory barrier steps taking place with the pthread mutex?

@Jörgen you didn't miss anything back then, they didn't exist! They're found in Linux 2.6.x (developed during 2.5.x development series) — Nektarios, Jun 15 '11 at 21:40
@Nektarios: actually, similar kinds of locks did exist much earlier. I believe the original DRI lock (around '91, SGI) was similar to current futexes. — ninjalj, Jun 15 '11 at 22:13
Do you have a reference for "POSIX mutexes considered heavier or slower than futexes"? Because as far as I know, for the past several years (since NPTL) pthreads on Linux have worked as you describe. — Nemo, Jun 15 '11 at 22:19
@Nemo: Just curious, if they work as I've described in my question (i.e., both remain in user-space when uncontested, and both make kernel-calls when contested), then why go through the trouble of using a futex over a mutex? — Jason, Jun 15 '11 at 23:13
@Nemo: BTW, it seems like on [this thread](http://stackoverflow.com/questions/6089917/how-to-achieve-lock-free-but-blocking-behavior/6089975#6089975) people really like the fact I answered by suggesting a futex over a mutex, since my assumption was a mutex must aways, even in the uncontended case, make a call into the kernel. I noticed though that some posters were mentioning that on Linux, pthread mutexes and semaphores are wrappers on futexes, which then begs the question why deal with the trouble of using a futex when higher-level abstractions are readily available? — Jason, Jun 15 '11 at 23:21
@Jason: As far as I know, there is no reason. A "futex" is just a Linux primitive to allow efficient implementation of mutexes and condition variables. Now that those efficient implementations are part of the standard library, there is no reason to use futexes directly as far as I know. — Nemo, Jun 15 '11 at 23:54
Any modern, sane implementation of pthread mutexes will use an underlying mechanism that's either a futex or something roughly equivalent. Thus the "mutex or futex" way you framed your question has a false assumption built into it. — R.. GitHub STOP HELPING ICE, Jun 16 '11 at 02:13
@R..: Thanks for the info ... while there might be a false assumption built into the question, I've seen other threads on S.O. draw a distinction between the two, so hearing from some sources that there was no distinction, and from other sources that there was a distinction, left me puzzled. — Jason, Jun 16 '11 at 02:19
The distinction is not between mutex and futex. The distinction is between a mutex implemented with futexes, and a mutex implemented with a userspace wait queue and signals (the old LinuxThreads way). I suppose you could also consider mutexes with no fast userspace path, like Windows' mutexes that *always* involve a syscall, but they're such a joke nobody has ever used them for implementing POSIX mutexes, to my knowledge... — R.. GitHub STOP HELPING ICE, Jun 16 '11 at 02:41
@R.. I believe that the Windows version of pthreads uses Win32 mutexes, as well as the boost threads and C++0x threads when compiled and run under windows. Please correct me if I'm wrong ... — Jason, Jun 16 '11 at 14:28
The real issue here is spotting a sane implementation of pthreads which would be "modern" enough to be using futexes to implement a mutex. — Michael Foukarakis, Sep 17 '11 at 04:07

ninjalj · Answer 1 · 2011-07-15T19:42:42.600

34

Futexes were created to improve the performance of pthread mutexes. NPTL uses futexes, LinuxThreads predated futexes, which I think is where the "slower" consideration comes. NPTL mutexes may have some additional overhead, but it shouldn't be much.

Edit: The actual overhead basically consists on:

selecting the correct algorithm for the mutex type (normal, recursive, adaptive, error-checking; normal, robust, priority-inheritance, priority-protected), where the code heavily hints to the compiler that we are likely using a normal mutex (so it should convey that to the CPU's branch prediction logic),
and a write of the current owner of the mutex if we manage to take it which should normally be fast, since it resides in the same cache-line as the actual lock which we have just taken, unless the lock is heavily contended and some other CPU accessed the lock between the time we took it and when we attempted to write the owner (this write is unneeded for normal mutexes, but needed for error-checking and recursive mutexes).

So, a few cycles (typical case) to a few cycles + a branch misprediction + an additional cache miss (very worst case).

edited Jul 15 '11 at 19:42

answered Jun 15 '11 at 22:16

ninjalj

42,493
9
106
148

So then is there a point to using a futex over a mutex (at least on Linux)? – Jason Jun 15 '11 at 23:16
2

@Jason: not really, unless you are writing your own libc, or want to create some synchronization primitive other than a mutex. – ninjalj Jun 16 '11 at 00:09
6

@Jason, seen just as a replacement of mutex, futex doesn't bring you much difference in performance. In contrast it has an API that is much more difficult to capture, so don't do that. Where futex can really be helpful is when it is viewed as a mutex *and* condition variable (on an `int` condition) in one go. There it is more time and space efficient and you avoid the pitfall of POSIX that someone could wait on a condition by using a different mutex. – Jens Gustedt Jun 16 '11 at 06:53
So, in short, when to use which ? – Eric Jan 22 '16 at 10:16

score 16 · Answer 2 · answered Nov 29 '16 at 19:55

16

Technically speaking pthread mutexes are not slower or faster than futexes. pthread is just a standard API, so whether they are slow or fast depends on the implementation of that API.

Specifically in Linux pthread mutexes are implemented as futexes and are therefore fast. Actually, you don't want to use the futex API itself as it is very hard to use, does not have the appropriate wrapper functions in glibc and requires coding in assembly which would be non portable. Fortunately for us the glibc maintainers already coded all of this for us under the hood of the pthread mutex API.

Now, because most operating systems did not implement futexes then programmers usually mean by pthread mutex is the performance you get from usual implementation of pthread mutexes, which is, slower.

So it's a statistical fact that in most operating systems that are POSIX compliant the pthread mutex is implemented in kernel space and is slower than a futex. In Linux they have the same performance. It could be that there are other operating systems where pthread mutexes are implemented in user space (in the uncontended case) and therefore have better performance but I am only aware of Linux at this point.

answered Nov 29 '16 at 19:55

Mark Veltzer

1,394
15
20

Is it "a statistical fact that in most operating systems that are POSIX compliant the pthread mutex is implemented in kernel space"? Really? Because I'd have thought it likely the precise opposite was the case — it's obvious that you'd want to avoid kernel transitions if possible, so I'd expect the authors of typical pthreads implementations to try to stay in user land as much as possible. – al45tair Jul 16 '18 at 17:17
It is impossible to always stay in user space when implementing a mutex because: - a thread that has to wait on the mutex must put itself to sleep (real sleep, not busy wait) and this is impossible without the OS. - a thread that exists a mutex has to check if there are other threads waiting on the mutex it is releasing and so has to wake them up. This is also impossible without the OS. – Mark Veltzer Aug 02 '18 at 14:45
This means that you can only stay in user space in the "uncontended" case which is when: - you reach a mutex and it is open so you don't have to go to sleep. - you release a mutex and find that no one is waiting for it and so you do not need to wake up anyone. – Mark Veltzer Aug 02 '18 at 14:49
And now for the crux: sure, most pthread implementors would like to stay in user space as much as possible but the implementation details are very difficult and require *special APIs from the OS*. This was missing in most standard UNIX systems. – Mark Veltzer Aug 02 '18 at 14:49
I'm aware that it's generally impossible to avoid a kernel transition to actually put the thread to sleep, but I think the view that "special" APIs are required is an exaggeration. I think it comes about because older kernels (not just Linux but generally) didn't support threads at all, and so a lot of the problems associated with trying to write a threading package on a kernel with no thread support have been misconstrued as inherent issues with threads. – al45tair Aug 03 '18 at 10:55
No. That's actually a mistake. For instance: you want to know if a mutex is currently held or not. How do you do that? User space only has a id of a mutex but all the state of the mutex is held in kernel space and kernel space is invisible from user space. If the state of the mutex is exposed to user space then user space may corrupt the state of the mutex and cause the kernel to crash (clear violation of userspace/kernel relations and OS definition). So new APIs are required. – Mark Veltzer Aug 04 '18 at 21:54
Please checkout "man 2 futex" and tell me if this is not a special API (btw: one of the hardest API to use and it's even hard to understand how this API has anything to do with futexes....). – Mark Veltzer Aug 04 '18 at 21:54
No, it isn't a mistake. It *may* be true that a futex-style API is an efficient way to do it, but e.g. up until 10.10 or so, the macOS pthread_mutex implementation used a Mach semaphore to block and kept its state in userland. I totally agree that `futex` and the `__psync_mutexdrop/wait` APIs in the current macOS are special APIs, FWIW, but I don't see any argument you've put forward that prevents the kind of approach seen in the macOS prior to 10.10. – al45tair Aug 05 '18 at 18:00
I highly doubt that the state of the semaphore was kept in userland. You mean that I could fiddle with the list of waiting processes for a semaphore? Take some off the list? Add ones that didn't want to wait for the semaphore? Please explain. Even today with futexes there was a need to carefully divide the state of the mutex between userland and kernelspace to avoid really bad race conditions and attacks... – Mark Veltzer Aug 06 '18 at 10:47
I didn't say it was. AFAIK the semaphore was only used for blocking; the *mutex* had its state held in user land. The implementation is quite complicated, but as I understand it it wouldn't touch the semaphore if the mutex was uncontended (i.e. no waiters). If you're interested, you can see it e.g. here: https://opensource.apple.com/source/Libc/Libc-498.1.7/pthreads/pthread_mutex.c.auto.html Newer versions changed the way it works internally by adding special syscalls. – al45tair Aug 06 '18 at 13:17

score 16 · Answer 3 · answered Sep 17 '11 at 01:48

The short answer to your question is that futexes are known to be implemented about as efficiently as possible, while a pthread mutex may or may not be. At minimum, a pthread mutex has overhead associated with determining the type of mutex and futexes do not. So a futex will almost always be at least as efficient as a pthread mutex, until and unless someone thinks up some structure lighter than a futex and then releases a pthreads implementation that uses that for its default mutex.

score 11 · Accepted Answer · answered Jun 15 '11 at 21:14

11

Because they stay in userspace as much as possible, which means they require fewer system calls, which is inherently faster because the context switch between user and kernel mode is expensive.

I assume you're talking about kernel threads when you talk about POSIX threads. It's entirely possible to have an entirely userspace implementation of POSIX threads which require no system calls but have other issues of their own.

My understanding is that a futex is halfway between a kernel POSIX thread and a userspace POSIX thread.

answered Jun 15 '11 at 21:14

Nektarios

10,173
8
63
93

1

The futex though must still make a system call if it is contested. If both the futex and mutex stay in user-space when uncontested, what are the "extra" system-calls the mutex must make over the futex when it's contested? Are you saying that the overhead for the mutex is in how it handles the contested case (i.e. more complex kernel calls compared to a futex)? – Jason Jun 15 '11 at 21:16
4

A kernel mutex doesn't stay in userspace when uncontested, it goes to kernel mode. Any thread operation in a kernel implementation of POSIX threads goes directly to kernel mode because there's no userspace portion of the implementation. – Nektarios Jun 15 '11 at 21:19
2

To make things more complicated (and more clear I hope), just because you're using 'pthreads' or POSIX threads doesn't mean you're using kernel or usermode implementations. In fact I don't know how to determine that except for source diving or experimental observation of their behavior. – Nektarios Jun 15 '11 at 21:27
1

I too was originally under the assumption that a mutex for a kernel-thread, even in the uncontested case, always made a kernel call, but I'm being told that on Linux, that is not true since Linux pthread mutexes and semaphores are wrappers around futexes which remain in user-space in the uncontested case. So that's what is leaving me a bit confused ... I don't see why we should go through the trouble of dealing with raw futexes when mutexes are simple to use and if they're based on futexes, should exhibit the exact same performance characteristics. – Jason Jun 15 '11 at 23:24
@Jason - in the case of an NPTL implementation of POSIX threads, if they are simply wrapping a futex, then you're right, there's no reason to deal with raw futexes because you're (presumably) going to get the exact same performance as you would get by wrapping a futex yourself. My discussion generally relates to kernel-level POSIX threads which is what your original question indicated to me. When you say "POSIX mutexes" realize that can mean kernel, userland, futex-based, or potentially other implementations of the POSIX API (remember POSIX is an API not an implementation detail) – Nektarios Jun 15 '11 at 23:40
1

'kernel' mutexes don't exist in Linux, only futexes – Spudd86 Sep 17 '11 at 02:14
DIY futex is problematic for one more reason. We really want to use the "robust futex" mechanism (set_robust_list) to avoid deadlock when a process dies holding a futex. However the Linux futex API provides only one "robust futex list" per thread, and that is already used by the pthread code in the glibc library. So we don't have a lot of choice here. We're pretty much compelled to use the pthread API. In reality, that's no problem since pthread is just "an implementation of robust futex" and it runs (according to my measurements) just as fast as a DIY futex. – the.jxc May 24 '23 at 22:17

score 1 · Answer 5 · answered Jul 10 '16 at 14:40

1

On AMD64 a futex is 4 bytes, while a NPTL pthread_mutex_t is 56 bytes! Yes, there is a significant overhead.

answered Jul 10 '16 at 14:40

frankencode

51
2

5

I think the question was about run time performance and not structure size. Besides, you need more data in user space and there is more data in kernel space for a futex so a futex is not 4 bytes, not by a long shot.... – Mark Veltzer Nov 29 '16 at 19:55

Why is a pthread mutex considered "slower" than a futex?

5 Answers5