Massive CPU load using std::lock (c++11)

Question

My recent efforts to implement a thread/ mutex manager ended up in an 75% CPU load (4 core), while all four running threads were either in sleep or waiting for a mutex beeing unlocked.

The specific class is far too large for being posted here entirely, but I could narrow down the cause to the deadlock-safe acquiring of two mutexes

std::unique_lock<std::mutex> lock1( mutex1, std::defer_lock );
std::unique_lock<std::mutex> lock2( mutex2, std::defer_lock );
std::lock( lock1, lock2 );

Another part of the class uses a std::condition_variable with wait() and notify_one() on mutex1 for some code to be executed selectively at the same time.

The simple change to

std::unique_lock<std::mutex> lock1( mutex1 );
std::unique_lock<std::mutex> lock2( mutex2 );

brought the CPU usage down to normal 1-2%.

I Cant believe, the std::lock() function is that inefficient. Could this be a bug in g++ 4.6.3?

edit: ( example )

#include <atomic>
#include <chrono>
#include <condition_variable>
#include <iostream>
#include <mutex>
#include <thread>

std::mutex mutex1, mutex2;
std::condition_variable cond_var;

bool cond = false;
std::atomic<bool>done{false};

using namespace std::chrono_literals;

void Take_Locks()
    {
    while( !done )
        {
        std::this_thread::sleep_for( 1s );

        std::unique_lock<std::mutex> lock1( mutex1, std::defer_lock );
        std::unique_lock<std::mutex> lock2( mutex2, std::defer_lock );
        std::lock( lock1, lock2 );

        std::this_thread::sleep_for( 1s );
        lock1.unlock();
        lock2.unlock();
        }
    }

void Conditional_Code()
    {
    std::unique_lock<std::mutex> lock1( mutex1, std::defer_lock );
    std::unique_lock<std::mutex> lock2( mutex2, std::defer_lock );

    std::lock( lock1, lock2 );
    std::cout << "t4: waiting \n";

    while( !cond )
        cond_var.wait( lock1 );

    std::cout << "t4: condition met \n";
    }

int main()
    {
    std::thread t1( Take_Locks ), t2( Take_Locks ), t3( Take_Locks );
    std::thread t4( Conditional_Code );

    std::cout << "threads started \n";
    std::this_thread::sleep_for( 10s );

    std::unique_lock<std::mutex> lock1( mutex1 );
    std::cout << "mutex1 locked \n" ;
    std::this_thread::sleep_for( 5s );

    std::cout << "setting condition/notify \n";
    cond = true;
    cond_var.notify_one();
    std::this_thread::sleep_for( 5s );

    lock1.unlock();
    std::cout << "mutex1 unlocked \n";
    std::this_thread::sleep_for( 6s );

    done = true;
    t4.join(); t3.join(); t2.join(); t1.join();
    }

You should post the code that contains the calls to `wait()` and `notify_one()` — Ben, Dec 02 '12 at 09:11
Complex locking schemes for multiple resource management that 'rely' on continually polling locks are hopeless, wasting CPU and/or wasting memory-bandwidth and/or introducing avoidable latency. If you are polling locks, you ARE doing it wrong. — Martin James, May 10 '14 at 15:55

score 30 · Accepted Answer · edited Aug 23 '18 at 22:12

30

On my machine, the following code prints out 10 times a second and consumes almost 0 cpu because most of the time the thread is either sleeping or blocked on a locked mutex:

#include <chrono>
#include <thread>
#include <mutex>
#include <iostream>

using namespace std::chrono_literals;

std::mutex m1;
std::mutex m2;

void
f1()
{
    while (true)
    {
        std::unique_lock<std::mutex> l1(m1, std::defer_lock);
        std::unique_lock<std::mutex> l2(m2, std::defer_lock);
        std::lock(l1, l2);
        std::cout << "f1 has the two locks\n";
        std::this_thread::sleep_for(100ms);
    }
}

void
f2()
{
    while (true)
    {
        std::unique_lock<std::mutex> l2(m2, std::defer_lock);
        std::unique_lock<std::mutex> l1(m1, std::defer_lock);
        std::lock(l2, l1);
        std::cout << "f2 has the two locks\n";
        std::this_thread::sleep_for(100ms);
    }
}

int main()
{
    std::thread t1(f1);
    std::thread t2(f2);
    t1.join();
    t2.join();
}

Sample output:

f1 has the two locks
f2 has the two locks
f1 has the two locks
...

I'm running this on OS X and the Activity Monitor application says that this process is using 0.1% cpu. The machine is a Intel Core i5 (4 core).

I'm happy to adjust this experiment in any way to attempt to create live-lock or excessive cpu usage.

Update

If this program is using excessive CPU on your platform, try changing it to call ::lock() instead, where that is defined with:

template <class L0, class L1>
void
lock(L0& l0, L1& l1)
{
    while (true)
    {
        {
            std::unique_lock<L0> u0(l0);
            if (l1.try_lock())
            {
                u0.release();
                break;
            }
        }
        std::this_thread::yield();
        {
            std::unique_lock<L1> u1(l1);
            if (l0.try_lock())
            {
                u1.release();
                break;
            }
        }
        std::this_thread::yield();
    }
}

I would be interested to know if that made any difference for you, thanks.

Update 2

After a long delay, I have written a first draft of a paper on this subject. The paper compares 4 different ways of getting this job done. It contains software you can copy and paste into your own code and test yourself (and please report back what you find!):

http://howardhinnant.github.io/dining_philosophers.html

edited Aug 23 '18 at 22:12

Super-intelligent Shade

5,932
23
46

answered Jan 25 '13 at 15:37

Howard Hinnant

206,506
52
449
577

On my machine, this example mostly prints a series of f1/f2 "...has the two locks" messages in no alternating order. (gcc 4.6.3) – mac Feb 03 '13 at 13:15
@mac the behavior may differ between operating systems or c++ implementations. I see the behavior you describe with libstdc++ on linux but I see alternating f1/f2 with libc++ on OS X. – bames53 Feb 05 '13 at 13:35
On OS X I see mostly alternating f1/f2, but not strictly. Every once in a while it breaks that pattern. More often if you shorten the sleep period. At any rate, the exact sequence of f1/f2 lock aquires is not really a pertinent point here. The main point is to dispute David Schwartz's assertion that `std::lock` can not be efficiently implemented. I.e. I strongly dispute: "There's no way the function can wait without spinning." And offer this experiment as evidence to the contrary, and invite anyone to post code that does spin. – Howard Hinnant Feb 06 '13 at 00:02
@bames53 interesting, OS X seems to have more 'advanced' implementations. – mac Feb 15 '13 at 20:04
I notice that when I run this program under VS2012 it uses a lot of CPU; 25% on a 4 core machine. If I change it not use deferred locks (and acquire the locks in a consistent order) then CPU usage drops to basically zero. I tried to read the implementation of `std::lock` to find out why but it was too ugly. @mac – bames53 Feb 15 '13 at 20:21
I definitely get the code to spin on linux by locking a mutex in `main()`. But keeping it with just two threads, I have done following changes: 1) add a short sleep before `lock()` in `t1()` and `t2()` for alternating order (see my first comment). 2) lock just one mutex in one thread and both in the other. 3) Because of the low refresh rate of my CPU monitor I increased the sleep duration after locking the single mutex to 500 ms. Now the code produces peaks up to 30% cpu load. – mac Feb 15 '13 at 23:01
@bames53 I use `std::thread` to keep the most possible portability. After your experience with VS2012 i am even happier with the decision to rewrite my code and avoid deferred locks. – mac Feb 15 '13 at 23:25
3

Using the updated `::lock()` function brought the CPU usage down to acceptable values. – mac Feb 16 '13 at 21:18
I know, basically it is the implementation from the link you posted in your very first comment. But the lack of understanding prevented me from testing it. – mac Feb 16 '13 at 21:56
Consider this code donated to the public domain. Performance bug reports may be in order for non-libc++ implementations. – Howard Hinnant Feb 16 '13 at 21:58
As written this code releases one of the locks before returning. Shouldn't all of them be held when the call returns? – Nemo Aug 29 '13 at 22:24
1

@Nemo Actually, std::unique_lock::release() breaks the association with the mutex without unlocking it. (Otherwise, the unique_lock would go out of scope with the mutex still associated, and its dtor would unlock the mutex.) – Nicu Stiurca Aug 29 '13 at 22:58
@HowardHinnant: OK got it. Still not sure how well this generalizes, but for two mutexes it is a good implementation. (Of a bad spec.) – Nemo Aug 29 '13 at 23:45
@Nemo: If you want to see how this is efficiently implemented for N mutexes (N > 2), see the open source code libc++ at (http://libcxx.llvm.org). The spec is efficiently implementable without heroic effort. It has been performance tested for up to N == 8 (with good results). If you need to simultaneously lock more mutexes than that, I would be interested to learn about your use case. – Howard Hinnant Dec 17 '13 at 01:26
Small question: why do you release the first lock after obtaining the second. Doesn't dining philosopher require both forks to eat? – StackedCrooked May 11 '14 at 17:17
5

@StackedCrooked: The `unique_lock::release()` function releases ownership of the lock. It doesn't unlock the mutex. So the local `u0` is responsible for locking `l0`, and *if* `l1.try_lock()` throws or returns false, `u0` is responsible for unlocking `l0`. If `l1.try_lock()` succeeds, now both `l0` and `l1` are locked. Mission accomplished. The only thing left to do is to tell `u0` it is no longer responsible for `l0`. The calling client is now responsible for both `l0` and `l1`. – Howard Hinnant May 11 '14 at 19:10
@HowardHinnant Oh, right. Silly mistake of me. Thanks. – StackedCrooked May 11 '14 at 20:34
@HowardHinnant On my Ubuntu 14.04 64bit with gcc 4.9.2 I have a lot of f2 (only them) in the beginning and then after few minutes I get only f1. There is no alternation of them whatsoever. – Patryk Nov 23 '14 at 18:06
@Patryk: Is that using the implementation of `lock` shown in my answer, or the gcc `std::lock`? If the latter, try the former. – Howard Hinnant Nov 23 '14 at 20:35
@HowardHinnant Using `lock` version from this answer the result is basically the same (some other random order but mostly a lot of `2`s and lot of `1` with very infrequent changes) – Patryk Nov 23 '14 at 22:32
@Patryk: Is the behavior the same if you have `f1` and `f2` lock only `m1`? – Howard Hinnant Nov 24 '14 at 00:37
@HowardHinnant Can you post the code somewhere? I cannot imagine how you would like it to look like. – Patryk Nov 24 '14 at 17:07
@Patryk: http://codepad.org/igAI2mSb The idea is to see if the behavior that you are seeing is from the algorithm that is trying to lock two mutexes, or coming from the thread scheduler, which is not under the control of this experiment. – Howard Hinnant Nov 24 '14 at 17:12
@HowardHinnant The behavior with this code is the same as I have described above. – Patryk Nov 24 '14 at 21:25
@Patryk: Thanks for running that test. That seems to me to indicate that there is nothing any two-lock-locking-algorithm could do on your system to change the behavior you are seeing. – Howard Hinnant Nov 25 '14 at 01:41

score 7 · Answer 2 · answered Dec 02 '12 at 09:14

7

As the documentation says, [t]he objects are locked by an unspecified series of calls to lock, try_lock, unlock. There is simply no way that can possibly be efficient if the mutexes are held by other threads for a significant period of time. There's no way the function can wait without spinning.

answered Dec 02 '12 at 09:14

David Schwartz

179,497
17
214
278

7

@David Schwartz: The spin can be very slow and non-cpu-intensive if after a failure there is both a yield and an attempt to lock/block on the mutex that previously failed the `try_lock`. Here is an efficient open source implementation: http://llvm.org/svn/llvm-project/libcxx/trunk/include/mutex . This implementation has been tested in high contention scenarios and been found to be very efficient. The premise of your answer is just false. The OS will park thread A if thread B holds any of the locks thread A needs, in the exact same manner as if we were only talking about a single mutex. – Howard Hinnant Jan 25 '13 at 04:23
@HowardHinnant No OS I know of will park a ready-to-run thread if it has an available core on which to run it. (I don't think you understand the constraints here. There is no way "an unspecified series of calls to lock, try_lock, unlock" can be efficient, and that's what's specifically required here.) – David Schwartz Jan 25 '13 at 05:03
3

All of the modern OS's I know of will park a thread if it is not ready to run because it is blocked on a locked mutex. That's what a decent implementation of this algorithm does. I gave you a link to the source code to review. – Howard Hinnant Jan 25 '13 at 14:54
I looked at that source code. It just repeats `lock, trylock, release, yield`. If there are more cores than ready-to-run threads, the `yield` is a no-op. That implementation is *awful*, and it's not the implementation's fault, the specification gives it no choice. Two threads running in lock step, one doing `lock(a,b)` and the other `lock(b,a)` can spin forever. – David Schwartz Jan 25 '13 at 15:04
2

Try to set it up. See if you can create the live-lock. – Howard Hinnant Jan 25 '13 at 15:18
@HowardHinnant: I can't get it to livelock with itself without cheating, but I can write reasonable code that, if running along with that function on the same two locks, will cause ridiculous CPU consumption, at least double what it would be if you just do `lock(a); lock(b);`. (Which, of course, you can't. Again, it's the specification that prevents an efficient implementation.) – David Schwartz Jan 25 '13 at 15:23
I'm not sure how to carry on this discussion without a format that allows showing each other code (which won't fit in comments). Suggestions? – Howard Hinnant Jan 25 '13 at 15:28
@DavidSchwartz I'd be interested in seeing that code and an explanation. – bames53 Jan 28 '13 at 01:08
1

@bames53: I decided to turn this into its own Q&A: http://stackoverflow.com/questions/18520983/. All of the answers so far disagree with me, but I agree with David: There is no reasonable way to implement this thing as specified. (Any form of busy-loop is "unreasonable" in my world...) – Nemo Aug 29 '13 at 22:29
@Nemo - correct. Polling for locks is hopeless. CPU waste, memory-bandwidth waste and latency. – Martin James May 10 '14 at 15:57

score 6 · Answer 3 · answered Jan 25 '13 at 03:18

6

The std::lock() non-member function may cause Live-lock problem or performance degradation, it guarantee only "Never Dead-lock".

If you can determine "Lock order(Lock hierarchy)" of multiple mutexes by design, it's preferable to not use generic std::lock() but lock each mutexes in pre-determinate order.

Refer to Acquiring Multiple Locks Without Deadlock for more detail.

answered Jan 25 '13 at 03:18

yohjp

2,985
3
30
100

Thats essentially what I ended up with, until an efficient and deterministic (conditional) lock method emerges. – mac Feb 03 '13 at 12:30

score 1 · Answer 4 · answered Dec 03 '12 at 04:09

1

First I want to thank for all answers.

During the work on an example code, that reproduces the effect, I found the source of trouble.

The conditional part locks both mutexes, while it uses just one for the std::condition_variable::wait() function.

But I still wonder, what going on behind the scene, that produces such a high CPU load.

answered Dec 03 '12 at 04:09

mac

395
1
3
8

3

See my answer. The thread is constantly locking one mutex, failing to lock the other, releasing the first mutex, failing again to lock the second, and repeating. – David Schwartz Dec 03 '12 at 20:01

Massive CPU load using std::lock (c++11)

4 Answers4

Linked