C++20 thread possibly waiting on std::atomic forever

Question

Consider the following example code where thread A pushes functions on a queue and thread B executes those when popping from the queue:

std::atomic<uint32_t> itemCount;

//Executed by thread A
void run(std::function<void()> function) {

    if (queue.push(std::move(function))) {
        itemCount.fetch_add(1, std::memory_order_acq_rel);
        itemCount.notify_one();
    }

}


//Executed by thread B
void threadMain(){

    std::function<void()> function;

    while(true){

        if (queue.pop(function)) {

            itemCount.fetch_sub(1, std::memory_order_acq_rel);
            function();

        }else{
            itemCount.wait(0, std::memory_order_acquire);
        }

    }

}

where queue is a concurrent queue which has a push and a pop function, each returning a bool indicating whether the given operation was successful. So push returns false if it's full and pop returns false if it's empty.

Now I was wondering if the code is thread-safe under all circumstances. Let's suppose thread B's pop fails and is about to invoke std::atomic<T>::wait. At the same time, thread A pushes a new element while thread B checks the initial wait condition. Since itemCount hasn't changed yet, it fails.

Immediately after that, thread A increments the counter and tries to notify one waiting thread (although thread B doesn't wait internally yet). Thread B finally waits on the atomic, causing the thread to never wake up again due to the lost signal despite there being an element in the queue. That only stops as soon as a new element is being pushed on the queue, notifying B to continue execution.

I wasn't able to reproduce this situation manually since the timing is close to impossible to get right.

Is this a serious concern or impossible to happen? What (preferably atomic) alternatives do exist in order to account for such rare situations?

EDIT: Just to mention, the queue is not blocking and only utilizes atomic operations.

The reason I'm asking is I don't understand how it's possible to implement an atomic wait operation. Although the standard says the whole operation is atomic (consisting of a load + predicate check + wait), in the implementation I'm using std::atomic<T>::wait is implemented roughly as follows:

void wait(const _TVal _Expected, const memory_order _Order = memory_order_seq_cst) const noexcept {
    _Atomic_wait_direct(this, _Atomic_reinterpret_as<long>(_Expected), _Order);
}

where _Atomic_wait_direct is defined as

template <class _Ty, class _Value_type>
void _Atomic_wait_direct(
    const _Atomic_storage<_Ty>* const _This, _Value_type _Expected_bytes, const memory_order _Order) noexcept {
    const auto _Storage_ptr = _STD addressof(_This->_Storage);
    for (;;) {
        const _Value_type _Observed_bytes = _Atomic_reinterpret_as<_Value_type>(_This->load(_Order));
        if (_Expected_bytes != _Observed_bytes) {
            return;
        }

        __std_atomic_wait_direct(_Storage_ptr, &_Expected_bytes, sizeof(_Value_type), _Atomic_wait_no_timeout);
    }
}

We can clearly see that there is an atomic load with the specified memory order to check the state of the atomic itself. However, I don't see how the whole operation can be considered atomic since there is a comparision right before the call to __std_atomic_wait_direct.

With condition variables, the predicate itself is secured by a mutex but how is the atomic itself secured here?

The `memory_order_acquire` and `memory_order_acq_rel` are the possible problem here. Without them, your described problem does not happen. With them, I can't tell you. And if you cannot prove your code correct without those flags, **don't use those flags**. — Yakk - Adam Nevraumont, Jan 04 '21 at 01:34
I already suspected this to be an issue, but why exactly is `memory_order_seq_cst` safe in this case? — Overblade, Jan 04 '21 at 01:37
If there are others like me who wonder in what alternate universe C++ `atomic` has condition variable like apis the answer is this universe in C++20 [Difference between std::atomic and std::condition_variable wait, notify_* methods](https://stackoverflow.com/questions/62859596/difference-between-stdatomic-and-stdcondition-variable-wait-notify-method) — bolov, Jan 04 '21 at 01:37
Seems to me you're trying to use an `std::atomic` as a counting semaphore. I'd just use an `std::counting_semaphore>` instead. I'd also build that into the queue itself, so `push` and `pop` deal with it, and the outside world doesn't. As an aside, I'd also pass a timeout to the `pop`, so it'll automatically wait up to som specified amount of time before failing. — Jerry Coffin, Jan 04 '21 at 01:39
@Overblade I can reason about `memory_order_seq_cst`, so I can show that your scenario cannot happen; I mean, read how `wait` works. I cannot reliably reason about `acquire` or `acq_rel`, so I cannot say anything of note (I do not fully understand how they work). I claim that if you cannot see that `memory_order_seq_cst` is safe, then maybe you shouldn't be using the more exotic memory model flags. — Yakk - Adam Nevraumont, Jan 04 '21 at 01:39
I am aware of how `memory_order_seq_cst` works, but still looking at the implementation of `wait` I don't think it can guarantee it can't happen. Actually, this is not a matter of atomicity itself; as I said the first load in wait (the predicate) could receive 0, then `fetch_add` sets it to 1 so that still the notify could be issued before the wait to the OS. I might be horribly off, but I want to justify at least *why* I think memory ordering isn't the main issue. — Overblade, Jan 04 '21 at 02:12
@Overblade Without sequential consistency, do you know if you'll see the +1 being set before you see the notification arriving? I don't know. With it, I do. — Yakk - Adam Nevraumont, Jan 04 '21 at 03:08
Exactly. But that misses the point. Like I said, it's not necessarily about ordering here. Consider the case I proposed. Take the strongest ordering, no optimizations at all. Can you *guarantee* that the notify happens after the wait call into the OS considering all possible timings between operations in the two threads? — Overblade, Jan 04 '21 at 03:17
`wait` is an atomic operation, and so is `fetch_add`. They don't interleave, but execute in some global order. If `wait` happens to execute first, then it would be followed by `fetch_add` and `notify_all`, which would wake it up. If `fetch_add` happens to execute first, then by the time `wait` is executed, `itemCount.load() != 0` and `wait` will return immediately. You think of `wait` as executing in steps - check the value of the atomic, then enter the wait - but that's as wrong as thinking that `fetch_add` fetches first then adds, and worrying that another operation would squeeze in between. — Igor Tandetnik, Jan 04 '21 at 04:07
If that's the case, then indeed there is no way the described situation could ever happen. Seems like I got confused by the standard library's code because the whole function itself didn't look like one atomic operation. Oh well. — Overblade, Jan 04 '21 at 04:18
@IgorTandetnik That is definitely true with sequential consistency. But without sequential consistency, what guarantee do you have that the thread doing the wait doesn't "see" the notify *before* the increment? The OP explicitly asked things are allowed to happen out of sequence. Do you know that the aquire/release and notify rules still provide sufficient guarantees? I don't. — Yakk - Adam Nevraumont, Jan 04 '21 at 06:11
@IgorTandetnik: I'm skeptical that it's that simple. I've been doing some tests that suggest that (on a common implementation), if you have a `x.fetch_add(1, memory_order_seq_cst)` followed by a `y.load(memory_order_relaxed)`, the load can be reordered in between the load and store of the `fetch_add()`. The `x.fetch_add()` is atomic in the sense that no stores to `x` can come between its load and store, and that other observers will not see intermediate values for `x`, but AFAIK the atomicity isn't related to its ordering. — Nate Eldredge, Jan 04 '21 at 07:45
In that case, it's reasonable to wonder whether `itemCount.notify_one()` could also be reordered in between the load and store of `itemCount.fetch_add()`, which would appear to cause this algorithm to deadlock. — Nate Eldredge, Jan 04 '21 at 07:48
@NateEldredge I added an answer with the detailed analysis. See if you find it convincing. — Igor Tandetnik, Jan 04 '21 at 13:35
@Yakk-AdamNevraumont I added an answer with the detailed analysis. See if you find it convincing. — Igor Tandetnik, Jan 04 '21 at 13:35
If you just spin until `queue.pop` returns `true` you don't need that `itemCount` at all. — Maxim Egorushkin, Jan 04 '21 at 16:53
@MaximEgorushkin I really want to avoid spinning if possible. Some implementations of `std::atomic` default to a short spin, then revert to something less-busy (e.g. libstdc++ from gcc). Of course there is no guarantee but I prefer a solution that at least *allows* non-spinning too. — Overblade, Jan 04 '21 at 18:14
If you'd like to avoid spinning you should use a blocking queue, possibly with a spin-mutex, rather than trying to invent a blocking `pop` for a non-blocking queue. — Maxim Egorushkin, Jan 04 '21 at 20:45

score 8 · Answer 1 · answered Jan 04 '21 at 13:34

Here's what the standard has to say:

[intro.races]/4 All modifications to a particular atomic object M occur in some particular total order, called the modification order of M.

[atomics.wait]/4 A call to an atomic waiting operation on an atomic object M is eligible to be unblocked by a call to an atomic notifying operation on M if there exist side effects X and Y on M such that:
(4.1) — the atomic waiting operation has blocked after observing the result of X,
(4.2) — X precedes Y in the modification order of M, and
(4.3) — Y happens before the call to the atomic notifying operation.

You posit the following scenario:

The current value of itemCount is zero, either from original initialization or a previous fetch_sub.
wait loads itemCount and observes the value of 0.
The other thread calls fetch_add and notify_one
wait goes blocking, as it believes the now-stale value of 0.

In this scenario, M is itemCount, X is the old fetch_sub that brought the value to 0 (we assume that it happened long ago and is properly visible to all threads) and Y is the fetch_add that changes the value to 1.

The standard says that the wait call (spanning steps 2 and 4) is in fact eligible to be unblocked by notify_one call on step 3. Indeed:

(4.1) - wait has blocked after observing itemCount == 0 (if it has not, then the problem fails to arise).
(4.2) - fetch_sub precedes fetch_add in the modification order of itemCount (by assumption that fetch_sub happened long ago).
(4.3) - fetch_add happens before (in fact, is sequenced before) notify_one; they are called one after the other by the same thread.

Therefore, a conforming implementation must either not allow wait to block in the first place, or have notify_one wake it up; it cannot allow the notification to be missed.

The only place where memory orders figure in this discussion is in (4.1), "the atomic waiting operation has blocked after observing the result of X". Perhaps fetch_add actually occurred before wait (by wall clock), but was not visible to wait, so it went blocking anyway. But it doesn't matter to the outcome - either wait observes the result of fetch_add and doesn't block at all; or it observes the result of the old fetch_sub and blocks, but then notfiy_one is required to wake it up.

So, I guess the point is that *eligible to be unblocked by* permits fairly weak ordering. Loading the result of X, where X precedes Y in modification order, does not prove that your load happened-before Y, but happens-before isn't what *eligible to be unlocked by* requires. — Nate Eldredge, Dec 04 '21 at 20:58

score 3 · Answer 2 · answered Dec 04 '21 at 22:39

Just to point out what may have led to the confusion in the first place: the function __std_atomic_wait_direct, all by itself, does the compare and block in an atomic fashion. It boils down to either a call to WaitOnAddress, or else uses a condition variable, where the compare is done while holding the associated mutex.

The if (_Expected_bytes != _Observed_bytes) is merely an optimization. If the variable is already different from the expected value, we can return right away without making the relatively expensive call to an underlying atomic function. It doesn't need to be atomic with the block, and could be removed altogether without affecting the semantics.

C++20 thread possibly waiting on std::atomic forever

2 Answers2

Linked