Performance of std::mutex and QMutex in MinGW 64 (posix thread version)

Question

I have tried to replace QMutex in my application (monte carlo simulation) by std::mutex, and surprisingly, the computation speed was divided by 3. The mutex locking/unlocking performance cost rised from negligible to about 66% of the threads time.

I digged into implementations sources. I initially thought that both were wrappers to Win32 threads (with an extra layer of pthread for std::thread), but in fact Qt is not using any kernel functions for the mutexes, and has its own internal implementation based on atomic variables. This one seems much faster.

Is it really possible to fully implement a mutex with only atomic variables? Does it have limitations?
Why is this not used in the STL?
What would be the guidelines for fully avoiding mutexes in my case? I am accumulating (additive operations) values over floating point buffers, shared between multiple threads

Thanks

score 10 · Accepted Answer · answered Mar 20 '15 at 21:05

Qt is not using any kernel functions for the mutexes, and has its own internal implementation based on atomic variables

Of course Qt calls the OS to make the thread wait if the mutex is already locked.

The idea here is that in good multithreaded code the chance of waiting on a already locked mutex are extremely low; the common case to optimize is the uncontended mutex, which must be kept as fast as possible.

Hence, QMutex is designed around the Linux futex system call, and to use atomics and lockfree programming. In the uncontended case, no system calls are necessary, only some clever atomics programming; in the contended case, system calls are necessary (to wait/wake threads), and indeed that's what Qt uses:

pthread_mutex_* / pthread_cond_* on a generic Unix
futex on Linux
WaitForSingleObjectEx on Windows
semaphore_* on Mac

For more info behind QMutex design, see here.

Why does the STL not use a similar approach? I have no idea. Possibly because there's native_handle function which should return "something" in all cases, even when the mutex is unlocked or locked but uncontended, so it always uses the system calls.

(Anyhow, I'm not surprised that Qt outperforms std::mutex. If it didn't, QMutex would be a wrapper around std::mutex.)

score 1 · Answer 2 · edited May 23 '17 at 12:02

1

It sounds like QMutex is implemented with a spinlock. To answer your questions:

Yes, it's possible. It has limitations, see this answer for instance.
With just atomics you can write a spinlock; you can't write a real mutex from anything else in the standard library.
Without knowing what your algorithm does it's hard to say. But it looks like you could benefit from reading this book: Is Parallel Programming Hard, And, If So, What Can You Do About It?. The Counting chapter in particular, seems to roughly correspond to what you described. TL;DR you will have to try out the different algorithms in that chapter to see which one is better for your particular case.

edited May 23 '17 at 12:02

Community

1
1

answered Mar 20 '15 at 20:47

DanielKO

4,422
19
29

In fact, peppe is right, Qt just use an atomic flag for speeding up the mutex lock when there is no concurrent lock. Otherwise it uses a system mutex – galinette Mar 22 '15 at 10:39
I can't use the lock-free counting approach, as this is a floating-point addition. Unfortunately, there are no atomic floats. – galinette Oct 31 '15 at 14:41

score 1 · Answer 3 · answered Apr 17 '19 at 13:59

1

Funny I came here trying to figure out why QMutex was so slow. At least for the heavily contented case wait for single object appears pretty slow compared to enterCriticalSection. TBB wraps that and appears to be much faster than QMutex on windows.

answered Apr 17 '19 at 13:59

Gedalia

443
2
5

Performance of std::mutex and QMutex in MinGW 64 (posix thread version)

3 Answers3