1

I have written a small async_task class that keeps a thread warm so that I can perform a background calculation. The task could be triggered by multiple threads, but only one instance of the task should be running at any one time. On one of my CI servers (a very old and slow mac mini circa 2011 - intel penryn processor) my unit test sometimes fails with a SIGABRT (compiling with Clang 9.0 - not AppleClang - on macOS 10.13). It's never failed on the Windows 10 build machine - intel i9 processor.

Here is a minimum representation of the code under test and the unit test extracted into a stand alone C++ application:

#include <thread>
#include <condition_variable>
#include <mutex>
#include <atomic>
#include <functional>
#include <iostream>

// class under test...

class async_task final
{
    std::thread thread_;
    std::condition_variable run_request_;
    std::mutex run_request_mutex_;
    std::atomic<bool> quit_{ false };
    std::atomic<bool> run_{ false };

public:
    async_task(std::function<void()> task) :
        thread_{ [this, task] { thread_proc(task); }}
    {
    }

    ~async_task()
    {
        {
            std::unique_lock<std::mutex> lock(run_request_mutex_);

            quit_ = true;
        }

        run_request_.notify_one();

        thread_.join();
    }

    void run()
    {
        {
            std::unique_lock<std::mutex> lock(run_request_mutex_);

            run_ = true;
        }

        run_request_.notify_one();
    }

private:
    void thread_proc(std::function<void()> task)
    {
        while (!quit_)
        {
            {
                std::unique_lock<std::mutex> lock{ run_request_mutex_ };

                run_request_.wait(lock, [this] { return quit_ || run_; });
            }

            bool run = false;

            if (run_.exchange(false))
            {
                run = !quit_;
            }

            if (run)
            {
                task();
            }
        }
    }
};


// exercising code...

int main()
{
    std::condition_variable condition;
    std::mutex mutex;

    std::atomic<bool> value = false;

    async_task task{ [&value, &mutex, &condition]()
    {
        {
            std::unique_lock<std::mutex> lock(mutex);

            value = true;
        }

        condition.notify_one();
    } };

    task.run();

    {
        using namespace std::chrono_literals;

        std::unique_lock<std::mutex> lock(mutex);

        if (!value)
        {
            condition.wait_for(lock, 5s, [&value] { return value.load(); });
        }
    }

    return EXIT_SUCCESS;
}

There must be a race condition, but I can't for the life of me spot what might cause a SIGABRT. Can anyone spot the problem?

UPDATE: added mutex to destructor to protect quit_ as this has been pointed out as a secondary problem - although not the cause of the issue in question.

keith
  • 5,122
  • 3
  • 21
  • 50
  • 5
    Member `thread_` should be defined as the last one, now your thread starts accessing other members - which could not be created yet. – rafix07 Feb 29 '20 at 17:39
  • 1
    Does this answer your question? [Why do I need to acquire a lock to modify a shared "atomic" variable before notifying condition\_variable](https://stackoverflow.com/questions/41867228/why-do-i-need-to-acquire-a-lock-to-modify-a-shared-atomic-variable-before-noti) – walnut Feb 29 '20 at 17:42
  • Your stop mechanism leads to deadlocks, see https://stackoverflow.com/a/60072132/412080 – Maxim Egorushkin Feb 29 '20 at 18:13
  • @rafix07, good spot, I suspect you're right. – keith Feb 29 '20 at 19:12
  • @Maxim Egorushkin, you have referenced an answer which was not marked as correct, and the answer marked correct contradicts. – keith Feb 29 '20 at 19:18
  • @keith There are multiple errors in that question, my answer deals with one of those, there is no contradiction. Here is an accepted answer for you: https://stackoverflow.com/a/55383456/412080 But I expected you to understand the problem in your code, rather than dismissing it on only on the basis of acceptance. It is a common error, people ask it multiple times a week. – Maxim Egorushkin Feb 29 '20 at 19:23
  • @walnut, I do use the pattern referenced for the `run_` variable. I will add this pattern for the quit variable also. However I would argue that this issue wouldn't cause a SIGABRT, it would cause a hang which would be reported in my CI as an aborted build - which it isn't when the error happens (the test always takes less than 5 ms to run). – keith Feb 29 '20 at 19:23
  • @Maxim Egorushkin, I disagree: the mutex is to stop the race condition when entering the `wait`. I don't believe modifying `run_` in the same thread as the `wait` needs to be protected by the mutex, but it needs to be atomic so that it's fenced for calls to `run` outside the wait mutex. – keith Feb 29 '20 at 19:28
  • @keith The condition notification gets lost if you don't lock the mutex when updating `quit_`, see that table. Which leads to a deadlock. Yes, it is subtle, you have to sit down and think about it. – Maxim Egorushkin Feb 29 '20 at 19:32
  • 1
    @keith: Maxim’s point is correct: there is a potential data race with the termination code above. The problem is that the check for `quit` and going to sleep are separate operations in the processing thread: nothing guarantees that between reading `quit` and calling `wait()` the other thread doesn’t set `quit` to `true` and notifies the condition variable. Thus, the notification may be lost and the `wait()`ing thread never wakes up. That problem does not exist when a mutex is used to protect the accesses to `quit`. – Dietmar Kühl Feb 29 '20 at 19:33
  • 1
    @Maxim Egorushkin, agreed on putting a mutex around `quit_` in the destructor. – keith Feb 29 '20 at 19:34
  • @keith And that makes `atomic<>` unnecessary. – Maxim Egorushkin Feb 29 '20 at 19:41
  • @Maxim Egorushkin, but both `quit_` and `run_` are modified in `thread_proc` outside the mutex so they must remain atomic. – keith Feb 29 '20 at 19:47

1 Answers1

1

The obvious race is in the order of member initialization: the std::thread kicks off immediately and the spawned thread may access the mutex and the condition variable before it is actually constructed. Making the std::thread the last member of your class should fix that.

Dietmar Kühl
  • 150,225
  • 13
  • 225
  • 380
  • The logic does have the potential to dead lock as long as the check of `quit_` and `wait()` are not made an atomic unit using a `std::mutex`: otherwise there is a potential of a lost notification. I think making the `wait()` a timed `wait()` should also work as the lost notification kind of doesn’t matter: eventually the condition will be rechecked. This approach does feel impure, though. – Dietmar Kühl Feb 29 '20 at 19:58
  • ... and, yes, I missed to use of `quit_` to meddle with `run_` : fir that `quit_` might need to remain `std::atomic`. It may be preferable to read just once and update `run_` from that, though. – Dietmar Kühl Feb 29 '20 at 20:01
  • I've added the mutex around `quit_` in the question just to get that one off the table. I don't see how there is a deadlock now in the way the rest of `thread_proc` is written? I agree that the `thread_` needs to move to the bottom of the list also. A missed notification is permissble. – keith Feb 29 '20 at 20:09
  • Ah, I see: the condition passed to the `wait()` gets checked before actually waiting. As that is done while the mutex is held, that use is actually save. Somehow I didn’t quite recognize that. – Dietmar Kühl Feb 29 '20 at 20:21
  • What nobody has pointed out though is that `atomic value` in `main` doesn't need to be atomic :-) – keith Feb 29 '20 at 20:31
  • Could you outline the potential for the lost notification? I'm revisiting this before it goes into production, and I can't see it. If `quit_` is `true` then it can loose the notification. I've moved `std::thread` to be the last member. – keith Apr 18 '20 at 18:52