How to fix slow performance of macOS pthread mutex locks?

Question

I have a function that modifies a shared resource in my multi-threaded program. This function is the only place where the threads touch a shared resource, and it's only for a small fraction of the overall work of each thread.

static int64_t
AddToSharedResource(volatile int64_t* value, int64_t to_add)
{
    int64_t result = *value;
    *value += to_add;
    return result;
}

I wanted to make my application thread-safe, so I added a simple mutex lock between the instructions.

static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;

static int64_t
AddToSharedResource(volatile int64_t* value, int64_t to_add)
{
    pthread_mutex_lock(&lock);
    int64_t result = *value;
    *value += to_add;
    pthread_mutex_unlock(&lock);
    return result;
}

Doing so renders my program to be more than 10x slower, making it even slower than the single-threaded version!

After reading up a bit more, it seems to be because of macOS implementation, which uses "fair" mutexes instead of using spinlocks, and that there are certain trade-offs between the implementations but this case is one of the cases which perform badly. However, the reason I've written the code this way is that I've already written the program in Win32 (where the lock caused barely any performance penalty), and I'm planning to port the function to Linux as well.

Is there a way to make this function thread-safe in macOS without creating a huge bottleneck, or do I need to redesign the platform layer?

How your resources are placed in memory matters **a lot**. I managed to speed up a threaded program x10 by avoiding [false sharing](https://en.wikipedia.org/wiki/False_sharing) - and not making any other changes. `alignas` and [std::hardware_destructive_interference_size](https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size) are your friends in that case. — Ted Lyngmo, Jul 11 '19 at 16:41
@TedLyngmo That might help. There are currently three `int64_t` stored in a single instance of a struct (i.e. they're stored sequentially). Thanks for the tips! Will try it out — Ted Klein Bergman, Jul 11 '19 at 16:49
You're welcome. I've noticed that the effect is different on different platforms. On Windows it was extreme, but on Linux I didn't quite get the same effect. So, perhaps it's not a solution for you, but it won't hurt (except that it eats more memory). — Ted Lyngmo, Jul 11 '19 at 16:51
`AddToSharedResource(volatile int64_t* value, int64_t to_add)` - Why is the pointer `volatile`? Note that `volatile` does *not* mean thread safe. — Jesper Juhl, Jul 11 '19 at 16:53
I have to ask the stupid question: are you building with optimization enabled? If not, do that as step 1 (on all your platforms). — Jesper Juhl, Jul 11 '19 at 16:55
@JesperJuhl I know, it's volatile because the value I'm passing into the function is the shared resource. If I understand correctly, `volatile` means that the compiler will not cache the value in the register, but instead re-read it from memory every time. Hence, the shared resource should be `volatile`? — Ted Klein Bergman, Jul 11 '19 at 16:55
No, if you sync access to your resources properly, you never need `volatile`. Volatile is for things changing out of the compilers control, like a memory address mapped directly to some hardware signal. — Ted Lyngmo, Jul 11 '19 at 16:57
That's not what `volatile` is for. It's mainly for reading from hardware registers and the like. The main effect you'll get here is to prevent the compiler from optimizing access to the pointer. You are just making your code run slower. — Jesper Juhl, Jul 11 '19 at 16:59
So, if `volatile` was added when you made it multithreaded, removing it might be a lifesaver. — Ted Lyngmo, Jul 11 '19 at 17:00
@TedLyngmo [This](https://en.wikipedia.org/wiki/Volatile_(computer_programming)) suggest that it's used to prevent the optimizer from optimizing away *"reads or writes and thus incorrectly reusing a stale value or omitting writes. [...] in threading".* I'll try to remove it though to see if it works better — Ted Klein Bergman, Jul 11 '19 at 17:03
That's perfectly valid, but you don't need that. You will not reuse a stale value or omitting writes when using the proper syncing mechanisms. And note that it prevents the optimizer from doing what it's good at :-) — Ted Lyngmo, Jul 11 '19 at 17:05
@JesperJuhl @TedLyngmo I tried to remove `volatile`, but the behavior is still unchanged, so I doubt that's the culprit. But thanks for the suggestion; I was pretty sure of my understanding of `volatile`, but I think I'll have to look more into it then — Ted Klein Bergman, Jul 11 '19 at 17:07
It'll be babysteps to get it up to singlethreaded speed and beyond I guess, but removing hurdles on the way is always good. Side note: I assume your compiler is C++11 compliant. Go for `` to get `RAII` locking etc. It's much simpler and doesn't have as many pitfalls. — Ted Lyngmo, Jul 11 '19 at 17:09
BTW, your understanding of `volatile` was roughly correct for single-core processors with simple memory/cache hardware, because compiler optimizations were all you had to worry about. It is exactly _incorrect_ for all modern hardware, and even in the situations where it worked, it was a non-standard non-portable hack. — Useless, Jul 11 '19 at 17:14
@TedLyngmo Will probably do that! Although most likely after I've visited every pitfall within each platform own API :P Just for funsies! — Ted Klein Bergman, Jul 11 '19 at 17:23
Compare your results with [for_each](https://en.cppreference.com/w/cpp/algorithm/for_each)`(ExecutionPolicy...`. It's a high level of abstraction from doing it yourself (C++17). Yet again, not the same benefit on all platforms but I was really surprised when I fed it data that didn't need syncing and was separated by `std::hardware_destructive_interference_size`. Silly-speed. I couldn't match it with a handcrafted version trying to do the same. — Ted Lyngmo, Jul 11 '19 at 17:30
@TedLyngmo Sorry, not quite sure what you mean. What is it I should iterate over using `std::for_each`? — Ted Klein Bergman, Jul 11 '19 at 17:42
It's for a "I have some problems to solve" kind of situation. Perhaps not suited for this, but if you know that you have a finite number of things to do, and you can separate them so they never need syncing, I'd try `for_each(ExecutionPolicy...`. — Ted Lyngmo, Jul 11 '19 at 17:44
@TedLyngmo Aah, I see! Might give that a go! Interesting to see what the compiler can come up with :D — Ted Klein Bergman, Jul 11 '19 at 17:46
The only drawback with the standard `ExecutionPolicy` is that none of the policies come with a terminating condition. I think (not sure) that MP lets you `throw` yourself out of a loop if you find something that makes you want to terminate. In the standard, it'll actually iterate over the "problems" twice - and afaik, terminating the loop early can't be done - unless you implement your own ExecutionPolicy. How to do that is beyond my knowledge, and it's not for a lack of searching. I created iterators to keep feeding the `for_each` with new problems - not working :-) It needs to be finite. — Ted Lyngmo, Jul 11 '19 at 17:50
Can't you avoid the issue by making the variable thread local? — curiousguy, Nov 10 '19 at 09:06
@Useless "_your understanding of volatile was roughly correct for single-core processors with simple memory/cache hardware_" Not even on time sharing single CPU system was volatile practical for guaranteed future proof MT safe code. Volatile semantics is awful and so is C/C++ as a high level portable asm. Linux got it wrong and was burnt by C/C++ high level-ness several times. — curiousguy, Nov 10 '19 at 09:19

score 3 · Accepted Answer · answered Jul 11 '19 at 16:53

3

Your example is an exact match to std::atomic::fetch_add. Atomic operations should be much cheaper than doing the lock-modify-unlock dance, and have the added benefit of allowing to specify exact memory ordering semantics.

answered Jul 11 '19 at 16:53

Botje

26,269
3
31
41

Thanks for the suggestion! I'll have to look at it more closely first though. Do you happen to know the underlying instructions for macOS? I'm not that familiar with C++ threading library; I usually call the system calls directly. I've seen that clang has the [blocks extension](https://en.wikipedia.org/wiki/Blocks_(C_language_extension))... Could it be implemented with that maybe? – Ted Klein Bergman Jul 11 '19 at 17:13
:-) I guess that was aimed at me? `pthreads` is what `` is based on. On Mac and Linux, it's pretty much a thin wrapper - but with proper resource management. – Ted Lyngmo Jul 11 '19 at 17:14
blocks looks more like std::async, and both are much higher-level than std::atomic – Useless Jul 11 '19 at 17:17
@TedLyngmo I was actually wondering about the `std::atomic::fetch_add`! Since macOS uses `pthreads`, it should be possible to do this operation using system calls rather than using C++ `thread`/`atomic` library. Was just curious what the underlying call was on macOS. – Ted Klein Bergman Jul 11 '19 at 17:20
The atomic operations map exactly to atomic CPU instructions for your architecture. – Botje Jul 11 '19 at 17:23
I must say I rarely use those synchronizations features at all so I think @Useless is more capable to answer that. I keep syncs to a minimum so when there's a sync, two pointers are swapped and a new block of info is available. – Ted Lyngmo Jul 11 '19 at 17:23
@Botje So it's an intrinsic? I remember reading about a couple of threading intrinsics, but never thought of it afterward. Or am I'm misremembering? – Ted Klein Bergman Jul 11 '19 at 17:27
Threading is something for the OS, not the CPU. But "intrinsic" is a proper name for a function that maps closely to a CPU instruction, yes. Maybe you were thinking of SIMD intrinsics? – Botje Jul 11 '19 at 17:31
You can compare some versions on [godbolt](https://godbolt.org/z/GZub2Q) or disassemble your own OSX code. The mutex calls might be different on OSX, but the idea is the same. – Useless Jul 11 '19 at 17:33
Yeah, I was probably thinking of a SIMD intrinsic that was used in some threading twitch stream. Anyway, seeing as I can just do as @Useless (thanks for the tip!) said and compare the assembly in godbolt, I'll probably figure it out! Seems like this is the way to go though. Will read up on it more and test more excessively tomorrow. – Ted Klein Bergman Jul 11 '19 at 17:38
@Botje I was thinking of the [`lock`](https://godbolt.org/z/JvGN7l) intrinsic! So not exactly a _"threading intrinsic"_ in that regard, but not completely off :D – Ted Klein Bergman Jul 20 '19 at 17:57
"_specify exact memory ordering semantics._" I don't think you can claim that these "exact" ordering semantics make sense and that there exist any C++ program that has defined behavior. See my Q [How are MT programs proven correct with “non sequential” semantics?](https://stackoverflow.com/q/58722848/963864) – curiousguy Nov 10 '19 at 09:24

score 2 · Answer 2 · answered Jul 20 '19 at 17:54

It seems like the std::atomic::fetch_add (as @Botje suggested) is compiled using the lock instruction on my architecture while surrounding the critical section in a mutex lock/unlock requires two actual calls down to the kernel (which seems to explain the difference in performance).

I was interested in how to generate the same instructions using MacOS's API rather than C++'s standard library (just for fun). After scanning their documentation, I came over the header libkern/OSAtomic.h which defines functions for atomic operations.

#include <libkern/OSAtomic.h>

static int64_t
AddToSharedResource(volatile int64_t* value, int64_t to_add)
{
    int64_t result = OSAtomicAdd64(to_add, value);
    return result - to_add;  // As we want the previous value.
}

However, this produces a deprecation warning with the suggestion to use C++'s standard library instead.

'OSAtomicAdd64' is deprecated: deprecated in macOS 10.12 - Use 
    std::atomic_fetch_add_explicit(std::memory_order_relaxed) 
    from <atomic> instead.

So it seems even MacOS wants us to use the C++ standard library. Windows, on the other hand, haven't deprecated their corresponding InterlockedAdd64.

How to fix slow performance of macOS pthread mutex locks?

2 Answers2