Implement a high performance mutex similar to Qt's one

Question

I have a multi-thread scientific application where several computing threads (one per core) have to store their results in a common buffer. This requires a mutex mechanism.

Working threads spend only a small fraction of their time writing to the buffer, so the mutex is unlocked most of the time, and locks have a high probability to succeed immediately without waiting for another thread to unlock.

Currently, I have used Qt's QMutex for the task, and it works well : the mutex has a negligible overhead.

However, I have to port it to c++11/STL only. When using std::mutex, the performance drops by 66% and the threads spend most of their time locking the mutex.

After another question, I figured that Qt uses a fast locking mechanism based on a simple atomic flag, optimized for cases where the mutex is not already locked. And falls back to a system mutex when concurrent locking occurs.

I would like to implement this in STL. Is there a simple way based on std::atomic and std::mutex ? I have digged in Qt's code but it seems overly complicated for my use (I do not need locks timeouts, pimpl, small footprint etc...).

Edit : I have tried a spinlock, but this does not work well because :

Periodically (every few seconds), another thread locks the mutexes and flushes the buffer. This takes some time, so all worker threads get blocked at this time. The spinlocks make the scheduling busy, causing the flush to be 10-100x slower than with a proper mutex. This is not acceptable

Edit : I have tried this, but it's not working (locks all threads)

class Mutex
{
public:
    Mutex() : lockCounter(0) { }

    void lock()
    {
        if(lockCounter.fetch_add(1, std::memory_order_acquire)>0)
        {
            std::unique_lock<std::mutex> lock(internalMutex);
            cv.wait(lock);
        }
    }

    void unlock();
    {
        if(lockCounter.fetch_sub(1, std::memory_order_release)>1)
        {
            cv.notify_one();
        }
    }


private:
    std::atomic<int> lockCounter;
    std::mutex internalMutex;
    std::condition_variable cv;
};

Thanks!

Edit : Final solution

MikeMB's fast mutex was working pretty well.

As a final solution, I did:

Use a simple spinlock with a try_lock
When a thread fails to try_lock, instead of waiting, they fill a queue (which is not shared with other threads) and continue
When a thread gets a lock, it updates the buffer with the current result, but also with the results stored in the queue (it processes its queue)
The buffer flushing was made much more efficiently : the blocking part only swaps two pointers.

Hmm.. best locking is no locking. May I ask, what is struct of 'buffer', how large is each 'result' and what is done with buffer upon completion? Suspect XY..? — Martin James, Mar 22 '15 at 10:55
Did you try comparing the performance against a raw pthread mutex? — tohava, Mar 22 '15 at 11:42
@MartinJames : Buffer is an array of floats and values are accumulated (added). Size is too large to have one buffer per thread. If there was atomic floats I would be happy :) — galinette, Mar 22 '15 at 12:13
@MikeMB : Is it recommended? If there is only one core, this will cause problems AFAIK. I really cannot afford a potential bug here — galinette, Mar 22 '15 at 12:16
I don't see, how a spinlock (build ontop of atomics of course) would introduce a (functional) bug here (like a race condition). Depending on your applications access profile, it may worsen your performance and will produce some jitter, but may be a reasonable solution for your particular scenario. — MikeMB, Mar 22 '15 at 12:25
Link: [Roll Your Own Lightweight Mutex](http://preshing.com/20120226/roll-your-own-lightweight-mutex/). But: are you sure, you want to program low-level concurrency, as compared to scalable abstractions? i.e. [tbb](https://www.threadingbuildingblocks.org/) — Dmitry Ledentsov, Mar 22 '15 at 13:42
@Dmitry : I have found this article already, but there is no simple equivalent of the win32 semaphore in STL. I would be interested in knowing what the STL equivalent is — galinette, Mar 22 '15 at 14:26
@galinette: Unfortunately, there is none - you can only build one ontop of condition variables, atomics and mutexes (which - I think - is a serious shortcomming) — MikeMB, Mar 22 '15 at 14:38
To deal with the flushing thread, why not use two buffers? Lock the mutex only to switch a pointer (or index), and flush the buffer at leisure while the writing threads use the other buffer. — arayq2, Mar 22 '15 at 16:46
Your second example should be working correctly.. and should be indeed exactly what you want... what do you mean by "locks all threads"? — sbabbi, Mar 22 '15 at 18:00
@sbabbi : they all get stuck somewhere. I think they get stuck in cv.wait() but I did not check this. — galinette, Mar 22 '15 at 19:00
This doesn't answer your question about an efficient standard implementation of a mutex, but I would suggest to upload the structure of your program here or at stackexchange and look for structural improvements. E.g. flushing the whole buffer while holding a mutex should not be necessary. — MikeMB, Mar 22 '15 at 19:58
The code from your second edit will run into the following problem: Thread A calls `lock`, increments `lockCounter` from 0 to 1, returns and then does something. Thread B calls `lock` and increments `lockCounter` from 1 to 2. Now, thread A calls `unlock`, decrements `lockCounter` from 2 to 1, calls `notify_one` and returns. Thread B locks the mutex and enters `cv.wait()` which will block it indefinitely. A `condition_variable` is not a semaphore, you can only release threads that are *already waiting*. Also, wakeups from `cv.wait` can occur *spuriously*. — dyp, Mar 22 '15 at 20:59
Take a look at [this](http://stackoverflow.com/a/4793662/666785) answer to see how to simulate a semaphore using mutex+cv. Basically you need another (non atomic) counter. — sbabbi, Mar 22 '15 at 23:23
@sbabbi: The problem is that this results in two unconditional calls to `mutex.lock()` and `mutex.unlock()` each. I don't see how that should improve efficiency over simply locking and unlocking a single mutex once (wouldn't hurt to test though). — MikeMB, Mar 22 '15 at 23:45
Have you examined the Qt source code to see how QMutex works? That would probably give you some ideas... — Jeremy Friesner, Mar 23 '15 at 05:01
*Size is too large to have one buffer per thread* Really? Do the threads update **all** buffer data whenever they do write to the buffer? — Walter, Mar 23 '15 at 11:15
@Walter : no they update only a small part of a large buffer. But this part is random. To reduce mutex collisions the buffer is already subdivided in chunks which have their own mutex. This is why the collision rate is low, otherwise it would be very high. — galinette, Mar 23 '15 at 14:59
@Siyuan Ren : for developing yes but the code will be ported to cluster architectures. So Win32 API is not usable. — galinette, Mar 23 '15 at 15:00
@JeremyFriesner : I did but I am still not understanding well how it can work since I have concluded that it doesn't work while it does :) — galinette, Mar 23 '15 at 15:01
@galinette: MSVC has terrible implementation for `std::mutex`. Perhaps if you profile on the target platform you will find no difference between `std::mutex` and `pthread_mutex`. — Siyuan Ren, Mar 24 '15 at 13:56

MikeMB · Accepted Answer · 2015-05-12T14:56:54.090

General Advice

As was mentioned in some comments, I'd first have a look, whether you can restructure your program design to make the mutex implementation less critical for your performance .
Also, as multithreading support in standard c++ is pretty new and somewhat infantile, you sometimes just have to fall back on platform specific mechanisms, like e.g. a futex on linux systems or critical sections on windows or non-standard libraries like Qt.
That being said, I could think of two implementation approaches that might potentially speed up your program:

Spinlock
If access collisions happen very rarely, and the mutex is only hold for short periods of time (two things one should strive to achieve anyway of course), it might be most efficient to just use a spinlock, as it doesn't require any system calls at all and it's simple to implement (taken from cppreference):

class SpinLock {
    std::atomic_flag locked ;
public:
    void lock() {
        while (locked.test_and_set(std::memory_order_acquire)) { 
             std::this_thread::yield(); //<- this is not in the source but might improve performance. 
        }
    }
    void unlock() {
        locked.clear(std::memory_order_release);
    }
};

The drawback of course is that waiting threads don't stay asleep and steal processing time.

Checked Locking

This is essentially the idea you demonstrated: You first make a fast check, whether locking is actually needed based on an atomic swap operation and use a heavy std::mutex only if it is unavoidable.

struct FastMux {
    //Status of the fast mutex
    std::atomic<bool> locked;
    //helper mutex and vc on which threads can wait in case of collision
    std::mutex mux;
    std::condition_variable cv;
    //the maximum number of threads that might be waiting on the cv (conservative estimation)
    std::atomic<int> cntr; 

    FastMux():locked(false), cntr(0){}

    void lock() {
        if (locked.exchange(true)) {
            cntr++;
            {
                std::unique_lock<std::mutex> ul(mux);
                cv.wait(ul, [&]{return !locked.exchange(true); });
            }
            cntr--;
        }
    }
    void unlock() {
        locked = false;
        if (cntr > 0){
            std::lock_guard<std::mutex> ul(mux);
            cv.notify_one();
        }
    }
};

Note that the std::mutex is not locked in between lock() and unlock() but it is only used for handling the condition variable. This results in more calls to lock / unlock if there is high congestion on the mutex.

The problem with your implementation is, that cv.notify_one(); can potentially be called between if(lockCounter.fetch_add(1, std::memory_order_acquire)>0) and cv.wait(lock); so your thread might never wake up.

I didn't do any performance comparisons against a fixed version of your proposed implementation though so you just have to see what works best for you.

I have tried the spinlock. It causes another issue because the problem is slightly more complex : periodically, an extra thread locks the mutex, and flushes the buffer, which takes some time. This causes all the worker threads to enter the spinlock, and they do not free the sheduling, so the flush is slow. — galinette, Mar 22 '15 at 14:35
@MikeMB Regarding the edits, cv.wait/cv.notify are designed *exactly* to prevent this. Cv.notify will increment some kind of counter and cv.wait will check that out before sitting in waiting... otherwise condition variables would be almost completely useless. — sbabbi, Mar 22 '15 at 18:23
@sbabbi: Do you have a source for that? I think you confuse semaphores and condition variables. — MikeMB, Mar 22 '15 at 19:13
@MikeMB : thanks for the answer, it works. The performance is not as good as Qt mutexes, maybe since one collision generate potentially many locks. I finally implemented spinlocks, reduced flushing time to the minimal by swapping buffers, and as a final optimization, instead of locking the spinlock, the threads do a try_lock, if it fails they fallback to a per-thread queue, that they empty later when they successfully get a lock. — galinette, Mar 23 '15 at 15:06

Sebastian · Answer 2 · 2015-03-22T14:56:54.630

4

Not really an answer per definition, but depending on the specific task, a lock-free queue might help getting rid of the mutex at all. This would help the design, if you have multiple producers and a single consumer (or even multiple consumers). Links:

Though not directly C++/STL, Boost.Lockfree provides such a queue.
Another option is the lock-free queue implementation in "C++ Concurrency in Action" by Anthony Williams.
A Fast Lock-Free Queue for C++

Update wrt to comments:

Queue size / overflow:

Queue overflowing can be avoided by i) making the queue large enough or ii) by making the producer thread wait with pushing data once the queue is full.
Another option would be to use multiple consumers and multiple queues and implement a parallel reduction but this depends on how the data is treated.

Consumer thread:

The queue could use std::condition_variable and make the consumer thread wait until there is data.
Another option would be to use a timer for checking in regular intervals (polling) for the queue being non-empty, once it is non-empty the thread can continuously fetch data and the go back into wait-mode.

edited Mar 22 '15 at 14:56

answered Mar 22 '15 at 13:38

Sebastian

8,046
2
34
58

I have considered this approach. I did not try it yet, because I am concerned by the queue overflow issues. Also, how to avoid a spinlock in the consumer thread for waiting some incoming data? This would add a nearly useless thread working at full speed. – galinette Mar 22 '15 at 14:43
This would imply calling std::condition_variable::notify_one very often. This would be OK if the overhead is acceptable, to be tested Polling may lead to overflow issues as the code has to be run at very different production rates, and on potentially large number of cores – galinette Mar 22 '15 at 15:14
Probably it's the best to use some sort of hierarchy. Then overflowing could be avoided since the load is distributed. On the other hand depending on the data types, it might not hurt to simply make the queue large enough and use the `full` flag on producer side. However, this would also require to ensure that the queue is emptied by a certain amount once it stalls to avoid repeated blocking. – Sebastian Mar 22 '15 at 15:20
I will think at this solution. I prefer not to "consume" in a dedicated thread, as in some conditions, it may become a bottleneck (if too many threads are producing too fast). So the threads should fill a queue but sometimes take the initiative of consuming it to the buffer. This may be an elegant but somewhat complicated solution. – galinette Mar 22 '15 at 19:30

Implement a high performance mutex similar to Qt's one

2 Answers2

Linked