Destruction of condition variable randomly loses notification

Question

Given a condition_variable as a member of a class, my understanding is that:

The condition variable is destroyed after the class destructor completes.
Destruction of a condition variable does not need to wait for notifications to have been received.

In light of these expectations, my question is: why does the example code below randomly fail to notify a waiting thread?

#include <mutex>
#include <condition_variable>
#define NOTIFY_IN_DESTRUCTOR 

struct notify_on_delete {
    std::condition_variable cv;

    ~notify_on_delete() {
#ifdef NOTIFY_IN_DESTRUCTOR
        cv.notify_all();
#endif
    }
};

int main () {
    for (int trial = 0; trial < 10000; ++trial) {
        notify_on_delete* nod = new notify_on_delete();
        std::mutex flag;
        bool kill = false;

        std::thread run([nod, &flag, &kill] () {
            std::unique_lock<std::mutex> lock(flag);
            kill = true;
            nod->cv.wait(lock);
        });

        while(true) {
            std::unique_lock<std::mutex> lock(flag);
            if (!kill) continue;
#ifdef NOTIFY_IN_DESTRUCTOR
            delete nod;
#else
            nod->cv.notify_all();
#endif
            break;
        }
        run.join();
#ifndef NOTIFY_IN_DESTRUCTOR
        delete nod;
#endif
    }
    return 0;
}

In the code above, if NOTIFY_IN_DESTRUCTOR is not defined then the test will run to completion reliably. However, when NOTIFY_IN_DESTRUCTOR is defined the test will randomly hang (usually after a few thousand trials).

I am compiling using Apple Clang: Apple LLVM version 9.0.0 (clang-900.0.39.2) Target: x86_64-apple-darwin17.3.0 Thread model: posix C++14 specified, compiled with DEBUG flags set.

EDIT:

To clarify: this question is about the semantics of the specified behavior of instances of condition_variable. The second point above appears to be reenforced in the following quote:

Blockquote Requires: There shall be no thread blocked on *this. [ Note: That is, all threads shall have been notified; they may subsequently block on the lock specified in the wait. This relaxes the usual rules, which would have required all wait calls to happen before destruction. Only the notification to unblock the wait needs to happen before destruction. The user should take care to ensure that no threads wait on *this once the destructor has been started, especially when the waiting threads are calling the wait functions in a loop or using the overloads of wait, wait_for, or wait_until that take a predicate. — end note ]

The core semantic question seems to be what "blocked on" means. My present interpretation of the quote above would be that after the line

cv.notify_all(); // defined NOTIFY_IN_DESTRUCTOR

in ~notify_on_delete() the thread test is not "blocked on" nod - which is to say that I presently understand that after this call "the notification to unblock the wait" has occurred, so according to the quote the requirement has been met to proceed with the destruction of the condition_variable instance.

Can someone provide a clarification of "blocked on" or "notification to unblock" to the effect that in the code above, the call to notify_all() does not satisfy the requirements of ~condition_variable()?

You seem to have UB here. 1. Thread sets kill to true. 2. Main reads kill, deletes nod. 3. Thread dereferences nod. — n. m. could be an AI, Jan 07 '18 at 07:17
@n.m. so the problem is that nod is dereferenced *again* when the "run" thread wakes? That makes sense. However, I would have expected an error rather than hanging. Would you please make your comment a proper answer for review? — AngelGabriel, Jan 07 '18 at 08:48
Even the first time is problematic, but I don't know whether this is the root of the problem. Why would you expect an error message? Dereferencing a dangling pointer is not a diagnosable error. — n. m. could be an AI, Jan 07 '18 at 08:58
@n.m. would you please post an answer in which you articulate the order of operations leading to dereferencing a deleted variable? If the mistake is simply my flawed code I would be glad to know - I will accept your answer, and will edit my question to clarify. — AngelGabriel, Jan 07 '18 at 22:48
Following assessment by @mevets I have submitted a bug report to Apple. I will update this post if there is a response... — AngelGabriel, Jan 15 '18 at 01:16

nishantsingh · Answer 1 · 2018-01-04T11:28:18.440

9

When NOTIFY_IN_DESTRUCTOR is defined:
Calling notify_one()/notify_all() doesn't mean that the waiting thread is immediately woken up and the current thread will wait for the other thread. It just means that if the waiting thread wakes up at some point after the current thread has called notify, it should proceed. So in essence, you might be deleting the condition variable before the waiting thread wakes up (depending on how the threads are scheduled).

The explanation for why it hangs, even if the condition variable is deleted while the other thread is waiting on it lies on the fact the wait/notify operations are implemented using queues associated with the condition variables. These queues hold the threads waiting on the condition variables. Freeing the condition variable would mean getting rid of these thread queues.

edited Jan 04 '18 at 11:28

answered Jan 04 '18 at 11:18

nishantsingh

4,537
5
25
51

"the wait/notify operations are implemented using queues associated with the condition variables" - this makes sense. However, this seems to me to contradict the claim from en.cppreference.com (linked above) "It is only safe to invoke the destructor if all threads have been notified. *It is not required that they have exited their respective wait functions*: some threads may still be waiting to reacquire the associated lock, or may be waiting to be scheduled to run after reacquiring it." (emphasis mine) – AngelGabriel Jan 04 '18 at 11:48
Additionally, if the queues are associated with the condition_variable, how is the code succeeding the majority of the time? Is deallocated but fortuitously unmodified memory being used? – AngelGabriel Jan 04 '18 at 11:51
Our note came straight from [\[thread.condition.condvar\]/5](https://timsong-cpp.github.io/cppwp/thread.condition.condvar#5). – T.C. Jan 04 '18 at 12:11
@T.C. Maybe the issue comes down to semantics (and I'm just being dense)? If so, would you please clarify the specific meanings of "blocked," "notified/notification" and "wait calls" in the context of the code example above? Specifically the qualification that "only the notification to unblock the wait needs to happen before destruction" seems to me to be satisfied, since the spin-wait on the "kill" variable ensures that the the "run" thread must be waiting when "delete nod" is executed, and "delete nod" calls "cv.notify_all()" before cv is destroyed. – AngelGabriel Jan 05 '18 at 02:49
2

in simple sentence: `condition_variable` can't be destroyed until all threads complete using it, otherwise you have undefined behavior. Note number of available CPU cores may be smaller than number of treads waiting for condition to be meet. – Marek R Jan 07 '18 at 08:14
@MarekR your statement *is* very clear... however, it appears to me to be contradicted by quotes from http://en.cppreference.com "it is not required that they [other threads] have exited their respective wait functions" and by timsong-cpp.github.io "Only the notification to unblock the wait needs to happen before destruction". – AngelGabriel Jan 07 '18 at 08:38
this means that thread doesn't have to be waiting when condition is fulfilled, it can check condition later and it will not be blocked. It doesn't refer to variable lifetime and I'm pointing out that lifetime of variable must be longer than cases of usage of this value. – Marek R Jan 08 '18 at 10:44
@MarekR To make sure that I really am understanding your clarification I will try to restate it... in the quote from timing-cpp.github.io the term "the notification to unblock the wait" does NOT refer to the call cv.notify_all(). Rather, "notification to unblock the wait" means that the thread that was blocked on the condition variable no longer is. This clarifies the *meaning* of "notified" in a way that explains the observed undefined behavior. Please post your comment as an answer clarifying the meaning of "notified" and other terms so that I can accept it. – AngelGabriel Jan 10 '18 at 12:19
1

You can't use value after it was destroyed in any case in C++. It doesn't matter if it is a `std::condition_variable` or anything else. This is undefined behavior in in many cases will end with a crash. And `nod->cv.wait(lock);` is a case when you can use variable when it is getting destroyed. – Marek R Jan 10 '18 at 13:21
‘ ... - this makes sense. However, this seems to me to contradict the claim from en.cppreference.com ...’ I think that is a common enough observation to warrant an acronym. – mevets Jan 11 '18 at 15:00
I would guess this is not a common/well-known problem, simply because most people define all condition variables at global or file scope and not as a member of any class. – Erik Alapää Jan 14 '18 at 10:32
It is a well known problem in concurrent programming; and has been for at least 40 years. It is also readily avoided, for example reference counts. The root problem is that 'De Jure standard’ is a contradiction in terms, but does employ more programmers. – mevets Jan 14 '18 at 23:42

mevets · Accepted Answer · 2018-01-14T23:16:07.350

7

I am pretty sure your vendors implementation is broken. Your program looks almost OK from the perspective of obeying the contract with the cv/mutex classes. I couldn’t 100% verify, I am behind one version.

The notion of “blocking” is confusing in the condition_variable (CV) class because there are multiple things to be blocking on. The contract requires the implementation to be more complex than a veneer on pthread_cond* (for example). My reading of it indicates that a single CV would require at least 2 pthread_cond_t’s to implement.

The crux is the destructor having a definition while threads are waiting upon a CV; and its ruin is in a race between CV.wait and ~CV. The naive implementation simply has ~CV broadcast the condvar then eliminate it, and has CV.wait remember the lock in a local variable, so that when it awakens from the runtime notion of blocking it no longer has to reference the object. In that implementation, ~CV becomes a “fire and forget” mechanism.

Sadly, a racing CV.wait could meet the preconditions, yet not be finished interacting with the object yet, when ~CV sneaks in and destroys it. To resolve the race CV.wait and ~CV need to exclude each other, thus the CV requires at least a private mutex to resolve races.

We aren’t finished yet. There usually isn’t an underlying support [ eg. kernel ] for an operation like “wait on cv controlled by lock and release this other lock once I am blocked”. I think that even the posix folks found that too funny to require. Thus, burying a mutex in my CV isn’t enough, I actually require a mechanism that permits me to process events within it; thus a private condvar is required inside the implementation of CV. Obligatory David Parnas meme.

Almost OK, because as Marek R points out, you are relying on referencing a class after its destruction has begun; not the cv/mutex class, your notify_on_delete class. The conflict is a bit academic. I doubt clang would depend upon nod remaining valid after control had transferred to nod->cv.wait(); but the real customer of most compiler vendors are benchmarks, not programmers.

As as general note, multi-threaded programming is difficult, and having now peaked at the c++ threading model, it might be best to give it a decade or two to settle down. It’s contracts are astonishing. When I first looked at your program, I thought ‘duh, there is no way you can destroy a cv that can be accessed because RAII’. Silly me.

Pthreads is another awful API for threading. At least it doesn’t attempt over-reach, and is mature enough that robust test suites keep vendors in line.

edited Jan 14 '18 at 23:16

answered Jan 13 '18 at 17:39

mevets

10,070
1
21
33

*I am pretty sure your vendors implementation is broken.* Why? This sequence is legal: 1. Thread A enters desctructor. 2. Thread A calls `notify_all()`. 3. Thread A returns from destructor. 4. `notify_on_delete` structure is desroyed. 5. Thread B wakes up and finds the `std::condition_variable` no longer exists. End result: undefined behavior, especially given the posted code does not ensure ["that no threads attempt to wait on *this once the destructor has been started"](http://localdoc.scusa.lsu.edu/cppreference/en/cpp/thread/condition_variable_any/~condition_variable_any.html) – Andrew Henle Jan 13 '18 at 17:56
Because the condition_variable class specifically calls this out: [link](http://en.cppreference.com/w/cpp/thread/condition_variable/~condition_variable). It is not even stupid, yet here we are; that is to say the destructor for the CV has to clear out all waiters.... – mevets Jan 13 '18 at 18:30
1

You completely missed "It is only safe to invoke the destructor if all threads have been notified". It does **NOT** say "It is only safe to invoke the destructor after notify_all() has been called." Those are not the same thing. The posted code does not ensure every waiting thread **has actually been notified**. – Andrew Henle Jan 13 '18 at 18:47
next sentence: "It is not required that they have exited their respective wait functions: some threads may still be waiting to reacquire the associated lock, or may be waiting to be scheduled to run after reacquiring it.” -- the lock is re-acquired by cv.wait(), which is within the class. Thus, the destruction of this class *must* synchronize with the cv, ensuring that all invokers have in fact exited, not just potentially. It is brain damaged, so fits quite nicely, but that is what it states. – mevets Jan 13 '18 at 18:54
*"It is not required that they have exited their respective wait functions..."* That does not mean that they've "been notified" and actually reached the point of waiting on the lock. Yes, it's totally brain damaged. I'd argue the spec has a hole in it by requiring the waiting threads to actually "[be] notified" while allowing them to still be in `wait()` as I see no way to tell if that's true or not. Given that, I think offhand the only way to ensure it works is to verify each thread has actually returned from `wait()`. Yes, that's U-G-L-Y. – Andrew Henle Jan 13 '18 at 19:04
This is exactly the discussion I was hoping to elicit when I posted the question. Thank you both so much! – AngelGabriel Jan 14 '18 at 10:16
@Andrew Henle - I think you have clearly stated my own confusion regarding the specification. My hope was that I had misunderstood the specification... and my intent was to clear up that misunderstanding so as not to repeat it. Can you please elaborate on what constitutes having "actually been notified" as opposed to a call to notify_all(). Is the requirement that the thread wake? – AngelGabriel Jan 14 '18 at 10:34
@AndrewHenle You mentioned that "the posted code does not ensure 'that no threads attempt to wait on *this once the destructor has been started'". The execution of "nod->cv.wait(lock);" and "delete nod;" are mutually exclusive and ordered with "delete nod" happening second due to the locking of "std::mutex flag;" and the spin wait on "bool kill". – AngelGabriel Jan 14 '18 at 10:43
Aside: @mevets would you be so kind as to post a link to an explanation of how it is that I am relying on referencing nod after its destruction has begun? Does the process of *returning from* "nod->cv.wait(lock);" by definition constitutes a use of the variable "nod"? To be clear, I can *imagine* an implementation in which nod is referenced after the calling thread wakes... but I don't understand why such behavior must by definition occur. – AngelGabriel Jan 14 '18 at 10:57
(warning, opinion) Imagine your outer class has some monitoring facility, like a reference count. In this world, nod->cv.wait() becomes '(nod->refc++, temp=nod->cv.wait(), nod->refc--, temp)’. thus your destructor would be invoked before nod->refc-- occurred. The destructor ordering rules (embedded classes are destroyed first) would make sense of this, but as your problem has underlined the stated rules are not bound by making sense. – mevets Jan 14 '18 at 22:12
(ps) I think you are giving this contract too much credit ( which is why I am leaning the opposite way ). The oddly worded restrictions strike me as the vague awareness of a certain unease rather than a model or ideal. – mevets Jan 15 '18 at 00:14
Answer accepted & bounty awarded. The presently-more-upvoted answer by user3286661 does not address the *actual* question regarding the meaning of the seemingly ambiguous terms in the ~condition_variable specification, whereas this answer, and the ensuring discussion tackle the problem explicitly. – AngelGabriel Jan 15 '18 at 01:02
thanks, btw you should take a look at absl:. I just learned of it from another question, and it looks like they took the threading issue a bit more seriously. – mevets Jan 15 '18 at 01:09

Destruction of condition variable randomly loses notification

2 Answers2