Adding blocking functions to lock-free queue

Question

I have a lock-free multi producer, single consumer queue, based on a circular buffer. So far, it only has non-blocking push_back() and pop_front() calls. Now I want to add blocking versions of those calls, but I want to minimize the impact this has on the performance of code that uses the non-blocking versions - namely, it should not turn them into "lock-by-default" calls.

E.g. the simplest version of a blocking push_back() would look like this:

void push_back_Blocking(const T& pkg) {
    if (!push_back(pkg)) {
        unique_lock<mutex> ul(mux);
        while (!push_back(pkg)) {
            cv_notFull.wait(ul);
        }
    }
}

but unfortunately this would also require to put the following block at the end of the "non-blocking" pop_front():

{
    std::lock_guard<mutex> lg(mux);
    cv_notFull.notify_all();
}

While the notify alone has hardly any performance impact (if no thread is waiting), the lock has.

So my question is:
How can I (using standard c++14 if possible) add blocking push_back and pop_front member functions to my queue without severely impeding the performance of the non_blocking counterparts (read: minimize system calls)? At least as long as no thread is actually blocked - but ideally even then.

For reference, my current version looks similar to this (I left out debug checks, data alignment and explicit memory orderings):

template<class T, size_t N>
class MPSC_queue {
    using INDEX_TYPE = unsigned long;
    struct Idx {
        INDEX_TYPE idx;
        INDEX_TYPE version_cnt;
    };
    enum class SlotState {
        EMPTY,
        FILLED
    };
    struct Slot {
        Slot() = default;               
        std::atomic<SlotState> state= SlotState::EMPTY;
        T data{};
    };
    struct Buffer_t {
        std::array<Slot, N> data{}; 
        Buffer_t() {
            data.fill(Slot{ SlotState::EMPTY, T{} });
        }
        Slot& operator[](Idx idx) {
            return this->operator[](idx.idx);
        }
        Slot& operator[](INDEX_TYPE idx) {
            return data[idx];                   
        }
    };

    Buffer_t buffer;
    std::atomic<Idx> head{};
    std::atomic<INDEX_TYPE> tail=0;

    INDEX_TYPE next(INDEX_TYPE old) { return (old + 1) % N; }

    Idx next(Idx old) {
        old.idx = next(old.idx);
        old.version_cnt++;
        return old;
    }
public:     
    bool push_back(const T& val) {
        auto tHead = head.load();
        Idx wrtIdx;
        do {
            wrtIdx = next(tHead);
            if (wrtIdx.idx == tail) {
                return false;
            }
        } while (!head.compare_exchange_strong(tHead, wrtIdx));

        buffer[wrtIdx].data = val;
        buffer[wrtIdx].state = SlotState::FILLED;
        return true;
    }

    bool pop_front(T& val) {                
        auto rIdx = next(tail);
        if (buffer[rIdx].state != SlotState::FILLED) {
            return false;
        }
        val = buffer[rIdx].data;
        buffer[rIdx].state = SlotState::EMPTY;
        tail = rIdx;
        return true;
    }
};

Related questions:

I asked a similar question specificly about optimizing the usage of condition_variable::notify here, but the question got closed as a supposedly duplicate of this question.
I disagree, because that question was about why the mutex is needed for condition variables in general (or rather it's pthread equivalent) - focusing on condition_variable::wait - and not if/how it can be avoided for the notify part. But apparently I didn't make that sufficiently clear (or people just disagreed with my opinion).

In any case, the answers in the linked question did not help me and as this was somewhat of an XY-problem anyway, I decided to ask another question about the actual problem I have and thus allow a wider range of possible solutions (maybe there is a way to avoid condition variables altogether).

This question is also very similar, but

It is about C on linux and the answers use platform specific constructs (pthreads and futexes)
The author there asked for efficent blocking calls, but no non-blocking ones at all. I on the other hand don't care too much about the efficiency of the blocking ones but want to keep the non-blocking ones as fast as possible.

`This question` link is invalid. Could you fix it? Also, atomic operations on `head` member are not actually *lock-free*, because of its size (2 * unsigned long). — Tsyvarev, Sep 21 '15 at 10:39
@Tsyvarev:Thank you, I fixed the links. `std::atomic` IS lock-free on VS2015 (x64) (just check `std::atomic{}.is_lock_free()`)and as far as I know, gcc and clang can do this even for 128 Bit sized data types (two `size_t` variables) — MikeMB, Sep 21 '15 at 10:54
Oh, just found that modern x86_64 supports double CAS. Nevermind. — Tsyvarev, Sep 21 '15 at 10:59

score 3 · Accepted Answer · answered Sep 21 '15 at 13:20

If there is potential waiter on condition variable, you have to lock mutex for notify_all call.

The thing is that condition check (!push_back(pkg)) is performed before wait on condition variable (C++11 provides no other way). So mutex is the only mean which can garantee constistency between these actions.

But it is possible to omit locking (and notification) in case when no potential waiter is involved. Just use additinal flag:

class MPSC_queue {
    ... // Original definitions
    std::atomic<bool> has_waiters;

public:
    void push_back_Blocking(const T& pkg) {
        if (!push_back(pkg)) {
            unique_lock<mutex> ul(mux);
            has_waiters.store(true, std::memory_order_relaxed); // #1
            while (!push_back(pkg)) { // #2 inside push_back() method
                cv_notFull.wait(ul);
                // Other waiter may clean flag while we wait. Set it again. Same as #1.
                has_waiters.store(true, std::memory_order_relaxed);
            }
            has_waiters.store(false, std::memory_order_relaxed);
        }
    }

    // Method is same as original, exposed only for #2 mark.
    bool push_back(const T& val) {
        auto tHead = head.load();
        Idx wrtIdx;
        do {
            wrtIdx = next(tHead);
            if (wrtIdx.idx == tail) { // #2
                return false;
            }
        } while (!head.compare_exchange_strong(tHead, wrtIdx));

        buffer[wrtIdx].data = val;
        buffer[wrtIdx].state = SlotState::FILLED;
        return true;
    }

    bool pop_front(T& val) {
        // Main work, same as original pop_front, exposed only for #3 mark.
        auto rIdx = next(tail);
        if (buffer[rIdx].state != SlotState::FILLED) {
            return false;
        }
        val = buffer[rIdx].data;
        buffer[rIdx].state = SlotState::EMPTY;
        tail = rIdx; // #3

        // Notification part
        if(has_waiters.load(std::memory_order_relaxed)) // #4
        {
            // There are potential waiters. Need to lock.
            std::lock_guard<mutex> lg(mux);
            cv_notFull.notify_all();
        }

        return true;
    }
};

Key relations here are:

Setting flag at #1 and reading tail for check condition at #2.
Storing tail at #3 and checking flag at #4.

Both these relations should expose some sort of universal order. That is #1 should be observered before #2 even by other thread. Same for #3 and #4.

In that case one can garantee that, if checking flag #4 found it not set, then possible futher condition check #2 will found effect of condition change #3. So it is safe to not lock (and notify), because no waiter is possible.

In your current implementation universal order between #1 and #2 is provided by loading tail with implicit memory_order_seq_cst. Same order between #3 and #4 is provided by storing tail with implicit memory_order_seq_cst.

In that approach, "Do not lock if no waiters", universal order is the most tricky part. In both relations, it is Read After Write order, which cannot be achieved with any combination of memory_order_acquire and memory_order_release. So memory_order_seq_cst should be used.

I didn't realize, you could minimize the number of memory barriers that much - Thank you very much! However, I think, `has_waiters` must be a counter instead of an bool, because currently - if I'm not missing something - you are setting the flag unconditionally to false, even if there might be other threads potentially waiting. — MikeMB, Sep 21 '15 at 13:40
Probably, using counter instead of flag would be more natural. But here `notify_all` will awake *all* waiters. And after being awaken, waiter rearms flag and checks condition again (see description for line after `.wait()` call). — Tsyvarev, Sep 21 '15 at 13:54
Right, I didn't see that - although I'd probably prefer using a counter and `notify_one` instead - I have to do some measurements. Btw.: I just realized, there is a severe bug in my queue: `Slot::state` isn't atomic! :( — MikeMB, Sep 21 '15 at 14:07
In the `pop_front()` method, if `has_waiters` is true, it acquires the `mux` mutex then notifies `cv_notFull.notify_all();` Is acquiring the mutex really necessary here? Per [cppreference](http://en.cppreference.com/w/cpp/thread/condition_variable/notify_all) it's not necessary to hold the lock in order to notify. That will (may) cause the interior of wait() to block again waiting for the mutex to become available. But I may have missed some subtle reason in this case where not locking would make it unsafe. Would it be any different for SPSC? — Joshua Blake, Jun 06 '16 at 07:48
Flag `has_waiters` is set **before actual waiter is added** for `cv_notFull` conditional variable (see `push_back_Blocking()` method). Without protected `.notify_all()` call it is possible that `.wait()` will *miss notification*. As for SPSC, you need to revisit set of methods, required for it. Possibly, single producer doesn't need both blocked and non-blocked methods. — Tsyvarev, Jun 06 '16 at 08:35
@Tsyvarev Ah, I see it now. If `pop_front` executes after the `while (!push_back(pkg))` but before the `cv_notFull.wait(ul);` it misses notification. Thanks. +1 — Joshua Blake, Jun 06 '16 at 09:22

Adding blocking functions to lock-free queue

1 Answers1

Linked