Overhead of pthread mutexes?

Question

I'm trying to make a C++ API (for Linux and Solaris) thread-safe, so that its functions can be called from different threads without breaking internal data structures. In my current approach I'm using pthread mutexes to protect all accesses to member variables. This means that a simple getter function now locks and unlocks a mutex, and I'm worried about the overhead of this, especially as the API will mostly be used in single-threaded apps where any mutex locking seems like pure overhead.

So, I'd like to ask:

do you have any experience with performance of single-threaded apps that use locking versus those that don't?
how expensive are these lock/unlock calls, compared to eg. a simple "return this->isActive" access for a bool member variable?
do you know better ways to protect such variable accesses?

Thanks a lot for the many answers! Some clarifications: - the current API has restrictions similar to OpenGL: objects can only be used/manipulated from the same thread where they were created. I want to remove that restriction. - good point about the "Accessor" methods - I'll keep that in mind for future APIs, but the current API can't be changed in that way - there are only a few public methods; so I'm going to add locking to all those methods (most of them are pretty high-level) - yes, I am new to multithreading :-) — , Aug 17 '09 at 08:45

score 40 · Answer 1 · answered Aug 14 '09 at 16:56

40

All modern thread implementations can handle an uncontended mutex lock entirely in user space (with just a couple of machine instructions) - only when there is contention, the library has to call into the kernel.

Another point to consider is that if an application doesn't explicitly link to the pthread library (because it's a single-threaded application), it will only get dummy pthread functions (which don't do any locking at all) - only if the application is multi-threaded (and links to the pthread library), the full pthread functions will be used.

And finally, as others have already pointed out, there is no point in protecting a getter method for something like isActive with a mutex - once the caller gets a chance to look at the return value, the value might already have been changed (as the mutex is only locked inside the getter method).

answered Aug 14 '09 at 16:56

cmeerw

7,176
33
27

21

the last paragraph is not true: if you don't use a mutex in a getter, the caller may get an invalid value, because a second thread could have overwritten the source value _partly_ while creating a copy. (Ok, maybe on Bools it works) – nob Jan 29 '11 at 14:46
8

While not 100% portable, I assume that any regular primitive(int, bool, or float) can be read atomically because it's guaranteed by most platforms. I'm backed up by glibc's manual which says "In practice, you can assume that int and other integer types no longer than int are atomic. You can also assume that pointer types are atomic; that is very convenient. Both of these are true on all of the machines that the GNU C library supports, and on all POSIX systems we know of." http://www.cs.utah.edu/dept/old/texinfo/glibc-manual-0.02/library_21.html#SEC360 – Gabe May 25 '11 at 21:53
4

It is true that you can read booleans atomically - the issue is that without the mutex, if the bool is updated in one thread and then later read in another, the second read might get an old, stale value from the processor's data cache. Taking the lock forces the processor to invalidate the data cache for that thread. – Tom Swirly Jan 10 '12 at 04:21
3

The matter is deep and depends upon many parameters of the architecture and the compiler. Which types are written atomically by the processor? What alignment can guarantee atomicity on the processor bus and the cache hierarchy? What is the cache coherency protocol? What is the memory consistency model implemented by the architecture? What is the memory consistency model assumed by the language/compiler? Generally, only under very strong assumptions can reads and/or writes to primitive types go unlocked -- and best left to people with intimate knowledge of their hardware and low-level software. – nccc Aug 22 '12 at 02:52

score 23 · Answer 2 · edited Apr 14 '14 at 00:21

"A mutex requires an OS context switch. That is fairly expensive. "

This is not true on Linux, where mutexes are implemented using something called futex'es. Acquiring an uncontested (i.e., not already locked) mutex is, as cmeerw points out, a matter of a few simple instructions, and is typically in the area of 25 nanoseconds w/current hardware.

For more info: Futex

Numbers everybody should know

score 7 · Answer 3 · answered Aug 14 '09 at 16:45

This is a bit off-topic but you seem to be new to threading - for one thing, only lock where threads can overlap. Then, try to minimize those places. Also, instead of trying to lock every method, think of what the thread is doing (overall) with an object and make that a single call, and lock that. Try to get your locks as high up as possible (this again increases efficiency and may /help/ to avoid deadlocking). But locks don't 'compose', you have to mentally at least cross-organize your code by where the threads are and overlap.

Great job getting to root of the problem. Encapsulate everything as much as possible when they are touched by multiple threads. Accessors weaken encapsulation. — deft_code, Aug 14 '09 at 17:23

score 4 · Answer 4 · answered Aug 14 '09 at 12:44

4

I did a similar library and didn't have any trouble with lock performance. (I can't tell you exactly how they're implemented, so I can't say conclusively that it's not a big deal.)

I'd go for getting it right first (i.e. use locks) then worry about performance. I don't know of a better way; that's what mutexes were built for.

An alternative for single thread clients would be to use the preprocessor to build a non-locked vs locked version of your library. E.g.:

#ifdef BUILD_SINGLE_THREAD
    inline void lock () {}
    inline void unlock () {}
#else
    inline void lock () { doSomethingReal(); }
    inline void unlock () { doSomethingElseReal(); }
#endif

Of course, that adds an additional build to maintain, as you'd distribute both single and multithread versions.

answered Aug 14 '09 at 12:44

Peter Cardona

2,061
1
14
14

1

using locks doesn't guarantee "getting it right". It might lead to deadlocks all over the place. – jalf Aug 14 '09 at 12:49
1

If the library in question doesn't call back into the user and doesn't lock anything else, then deadlocks are not a possibility. – caf Aug 14 '09 at 13:59
1

I would strongly recommend against separate threaded and non-threaded versions of your library. What will happen is that writers of apps or libraries that don't use threads themselves will want to link the non-threaded version, then as soon as they pull in a dependency on another library that uses the threaded version, all hell might break loose. – R.. GitHub STOP HELPING ICE Dec 08 '10 at 13:19

score 3 · Answer 5 · answered Aug 14 '09 at 12:53

I can tell you from Windows, that a mutex is a kernel object and as such incurs a (relatively) significant locking overhead. To get a better performing lock, when all you need is one that works in threads, is to use a critical section. This would not work across processes, just the threads in a single process.

However.. linux is quite a different beast to multi-process locking. I know that a mutex is implemented using the atomic CPU instructions and only apply to a process - so they would have the same performance as a win32 critical section - ie be very fast.

Of course, the fastest locking is not to have any at all, or to use them as little as possible (but if your lib is to be used in a heavily threaded environment, you will want to lock for as short a time as possible: lock, do something, unlock, do something else, then lock again is better than holding the lock across the whole task - the cost of locking isn't in the time taken to lock, but the time a thread sits around twiddling its thumbs waiting for another thread to release a lock it wants!)

score 2 · Answer 6 · answered Aug 14 '09 at 12:44

2

A mutex requires an OS context switch. That is fairly expensive. The CPU can still do it hundreds of thousands of times per second without too much trouble, but it is a lot more expensive than not having the mutex there. Putting it on every variable access is probably overkill.

It also probably is not what you want. This kind of brute-force locking tends to lead to deadlocks.

do you know better ways to protect such variable accesses?

Design your application so that as little data as possible is shared. Some sections of code should be synchronized, probably with a mutex, but only those that are actually necessary. And typically not individual variable accesses, but tasks containing groups of variable accesses that must be performed atomically. (perhaps you need to set your is_active flag along with some other modifications. Does it make sense to set that flag and make no further changes to the object?)

answered Aug 14 '09 at 12:44

jalf

243,077
51
345
550

1

Yeah, that's a good point, no need to lock data that's not exposed. My strategy is almost always: 1. all data is private 2. all public/protected methods that aren't static lock on the way in 3. locks are objects on the stack so that uncaught exceptions unlock the data. – Peter Cardona Aug 14 '09 at 12:46
are you saying to lock in *all* public/protected methods? That's silly. Lock in the ones it makes sense to call from other threads, and nothing else. – jalf Aug 14 '09 at 13:26
3

The only way to get correct behavior in a multithreaded library is to expect users of the library to use it correctly. Spamming locks at everything that people *can* misuse doesn't solve the problem or ensure correctness. Let the user know what may, and what may not, be called from other threads, and then lock *those* methods only. – jalf Aug 14 '09 at 13:28
When you say "called from other threads", do you mean "called concurrently from multiple threads"? Restrictions on concurrency are a different thing from restrictions on context - an object with something to do with OpenGL might for example "not be callable from other threads" than the one to which the OpenGL context is bound. This doesn't mean it needs external synchronisation, it means you can't use it at all except from one particular thread. – Steve Jessop Aug 14 '09 at 14:29
@jalf I'm assuming that all public/protected methods on the interface are for use by the library client. If a method is for internal use, it should be hidden from the client. If a method isn't using class fields, it can be static and doesn't need a lock. There's no point to a client having access to methods it doesn't need or could misuse. It's simple enough to make those private or implement them in a class that's not exposed to the client system and wouldn't need additional locking. It's easy to expect users to use the lib correctly if that's there only option. :-) Nobody reads the docs. – Peter Cardona Aug 14 '09 at 14:51
1

ugh, context switch. Don't use process shared mutexes on windows. They are hella slow. A context switch causes cache thrashing as well as burning CPU time. Use a critical section, it used processor atomics to check locking on only goes to the OS when there is lock contention. This is what Win32-Pthreads does unless you specifically ask for a process shared mutex. So mutexes are fast usually a only few instructions (one) in the common case. On linux mutexes are always really fast, shared or not. – deft_code Aug 14 '09 at 17:34
3

Actually... IIRC, on recent linuxes (which is half of the question) I think pthread_mutex_lock does not require a context switch most of the times as they are implemented around futexes (see http://en.wikipedia.org/wiki/Futex ) by default. – Fredrik Aug 14 '09 at 20:48
Sorry - I'm coming from an RTOS background, not Linux, so the answer might be different... but if the mutex is available, will it still require a context switch? Under every RTOS I know, there won't be a context switch (actually 2) unless the mutex is currently unavailble. – Dan Aug 15 '09 at 05:57
1

If a mutex requires a context switch, your OS is beyond unusably broken for concurrent programming. Linux definitely does not require context switches, for either process-local or process-shared mutexes, except in the case of contention (and even then it's just to avoid much-more-expensive spinning in userspace). – R.. GitHub STOP HELPING ICE Dec 08 '10 at 13:20
1

True. Not sure what I was thinking when I wrote this answer. A mutex *might* imply a context switch (if there is contention). Otherwise it costs a couple of atomic instructions (which are still more expensive than not having the mutex there at all, but nowhere near as expensive as a context switch. ) – jalf Dec 08 '10 at 18:29

score 2 · Answer 7 · edited Apr 14 '14 at 00:15

I was curious about the expense of using a pthred_mutex_lock/unlock. I had a scenario where I needed to either copy anywhere from 1500-65K bytes without using a mutex or to use a mutex and do a single write of a pointer to the data needed.

I wrote a short loop to test each

gettimeofday(&starttime, NULL)
COPY DATA
gettimeofday(&endtime, NULL)
timersub(&endtime, &starttime, &timediff)
print out timediff data

or

ettimeofday(&starttime, NULL)
pthread_mutex_lock(&mutex);
gettimeofday(&endtime, NULL)
pthread_mutex_unlock(&mutex);
timersub(&endtime, &starttime, &timediff)
print out timediff data

If I was copying less than 4000 or so bytes, then the straight copy operation took less time. If however I was copying more than 4000 bytes, then it was less costly to do the mutex lock/unlock.

The timing on the mutex lock/unlock ran between 3 and 5 usec long including the time for the gettimeofday for the currentTime which took about 2 usec

Which operating system and version, and which processor were you using for this test? What was the granularity of the system's clock, which clock did you use (wall/CPU/thread), how often did you repeat the operartions? Did you repeat at all, or did you take a signle time? Because on Linux 3.2 x64, a single core of my machine can perform a maximum of ~ 45 million `pthread_mutex_lock()`/`_unlcok()` pairs per second. And it's not exactly a high end system. But this time obviously includes caching and user space handling benefits. — class stacker, Nov 23 '15 at 08:01

score 1 · Answer 8 · answered Aug 14 '09 at 16:39

1

For member variable access, you should use read/write locks, which have slightly less overhead and allow multiple concurrent reads without blocking.

In many cases you can use atomic builtins, if your compiler provides them (if you are using gcc or icc __sync_fetch*() and the like), but they are notouriously hard to handle correctly.

If you can guarantee the access being atomic (for example on x86 an dword read or write is always atomic, if it is aligned, but not a read-modify-write), you can often avoid locks at all and use volatile instead, but this is non portable and requires knowledge of the hardware.

answered Aug 14 '09 at 16:39

Gunther Piez

29,760
6
71
103

1

You can never guarantee a read or write is atomic on a multicore system without processor help ie __sync_fetch*. On a single core machine it is safe to assume machine word read/writes are atomic, but single cores are going the way of dodo. – deft_code Aug 14 '09 at 17:22
1

@Caspin, _reading_ a dword is definitely an atomic operation in a sense that you're never going to get a mix of some "old" bits and some "new" bits even if another core is writing to the same memory location at the same time (but, for example, a qword is _not_ atomic on x86 in the same sense). I think you're confusing "atomic" with "ordered". – Pavel Minaev Aug 14 '09 at 20:48
1

@Caspin, you are wrong. Quoting from Intel® 64 Architecture Memory Ordering White Paper: "Intel 64 memory ordering guarantees that for each of the following memory-access instructions, the constituent memory operation appears to execute as a single memory access regardless of memory type: ... 3. Instructions that read or write a doubleword (4 bytes) whose address is aligned on a 4 byte boundary. 4. Instructions that read or write a quadword (8 bytes) whose address is aligned on an 8 byte boundary." So, among lots of others, this is atomic. BTW, Intel 64 refers to both IA-32 and 64. – Gunther Piez Aug 15 '09 at 07:20
I have compared RWLocks with MutExes and I completely disagree that they have less overhead. Also, in practice, they tend to perform worse than one thinks. 1) Modern OSs have efficient MutEx support at least for threads in the same process. This is built on atomic CPU instructions in user land. RWLocks are built on top of that, which makes them less efficient. 2) If write operations become necessary on a regular basis and take longer than reads, RWLocks don't provide any advantage. – class stacker Nov 23 '15 at 12:12

score 0 · Answer 9 · answered Aug 14 '09 at 20:43

0

Well a suboptimal but simple approach is to place macros around your mutex locks and unlocks. Then have a compiler / makefile option to enable / disable threading.

Ex.

#ifdef THREAD_ENABLED
#define pthread_mutex_lock(x) ... //actual mutex call
#endif

#ifndef THREAD_ENABLED
#define pthread_mutex_lock(x) ... //do nothing
#endif

Then when compiling do a gcc -DTHREAD_ENABLED to enable threading.

Again I would NOT use this method in any large project. But only if you want something fairly simple.

answered Aug 14 '09 at 20:43

mox1

614
3
10

1

I would not recommend this for the same reason I told the other person with this answer, and since using `#define pthread_mutex_lock` after including `pthread.h` results in undefined behavior (you could skip `pthread.h` entirely though if possible). – R.. GitHub STOP HELPING ICE Dec 08 '10 at 13:26

Overhead of pthread mutexes?

9 Answers9

Linked