Are there any implementations of concurrent lock-free blocking queues?

Question

I'm aware of blocking-queues and lock-free queues, a great example of those implementations being provided by Scott et al., but are there any implementations of a lock-free-blocking-queue?

In a lock-free-blocking-queue the dequeue will require no locking, but if there are no items in the queue it will block the consumer. Are there any implementations of such a beast? I prefer if they're C# implementations, but any implementation would technically work.

Update:

I think I end up with a race condition on line D14.1:

initialize(Q: pointer to queue t)
node = new node() // Allocate a free node
node–>next.ptr = NULL // Make it the only node in the linked list
Q–>Head = Q–>Tail = node // Both Head and Tail point to it
signal = new ManualResetEvent() // create a manual reset event

    enqueue(Q: pointer to queue t, value: data type)
E1:     node = new node() // Allocate a new node from the free list
E2:     node–>value = value // Copy enqueued value into node
E3:     node–>next.ptr = NULL // Set next pointer of node to NULL
E4:     loop // Keep trying until Enqueue is done
E5:         tail = Q–>Tail // Read Tail.ptr and Tail.count together
E6:         next = tail.ptr–>next // Read next ptr and count fields together
E7:         if tail == Q–>Tail // Are tail and next consistent?
E8:             if next.ptr == NULL // Was Tail pointing to the last node?
E9:                 if CAS(&tail.ptr–>next, next, <node, next.count+1>) // Try to link node at the end of the linked list
E10.1:                  signal.Set() // Signal to the blocking dequeues
E10.2:                  break // Enqueue is done. Exit loop
E11:                endif
E12:            else // Tail was not pointing to the last node
E13:                CAS(&Q–>Tail, tail, <next.ptr, tail.count+1>) // Try to swing Tail to the next node
E14:            endif
E15:        endif
E16:     endloop
E17:    CAS(&Q–>Tail, tail, <node, tail.count+1>) // Enqueue is done. Try to swing Tail to the inserted node


    dequeue(Q: pointer to queue t, pvalue: pointer to data type): boolean
D1:     loop // Keep trying until Dequeue is done
D2:         head = Q–>Head // Read Head
D3:         tail = Q–>Tail // Read Tail
D4:         next = head–>next // Read Head.ptr–>next
D5:         if head == Q–>Head // Are head, tail, and next consistent?
D6:             if head.ptr == tail.ptr // Is queue empty or Tail falling behind?
D7:                 if next.ptr == NULL // Is queue empty?
D8.1:                   signal.WaitOne() // Block until an enqueue
D8.X:                   // remove the return --- return FALSE // Queue is empty, couldn’t dequeue
D9:                 endif
D10:                CAS(&Q–>Tail, tail, <next.ptr, tail.count+1>) // Tail is falling behind. Try to advance it
D11:            else // No need to deal with Tail
                    // Read value before CAS, otherwise another dequeue might free the next node
D12:                *pvalue = next.ptr–>value
D13:                if CAS(&Q–>Head, head, <next.ptr, head.count+1>) // Try to swing Head to the next node
D14.1:                  if(head.ptr == tail.ptr && next.ptr==NULL) // Is queue empty? <--- POSSIBLE RACE CONDITION???
D14.2:                      signal.Reset()
D14.3:                  break // Dequeue is done. Exit loop
D15:                endif
D16:            endif
D17:         endif
D18:    endloop
D19:    free(head.ptr) // It is safe now to free the old dummy node
D20:    return TRUE // Queue was not empty, dequeue succeeded

The link is http://www.research.ibm.com/people/m/michael/podc-1996.pdf. He missed the 'f' at the end. — Na7coldwater, Apr 20 '11 at 18:03
@Will, sorry, my link was missing an f from the pdf, should be fine now (thanks for pointing it out). — Kiril, Apr 20 '11 at 18:03
note: .NET 4 brings a *lot* of concurrency and lock free implementation. — , Apr 20 '11 at 18:05
@David a locks result in numerous function calls and essentially make your application run sequentially with respect to the section/method they're locking. Removing locks makes your application truly multithreaded and it reduces the amount of code that has to be executed, but it requires **skilful** programming. — Kiril, Apr 20 '11 at 23:45
Lock free can be slower. Who says your locks are a problem at the moment? Did you check this? Plenty of truly multithreaded apps use locks. You are showing signs of a buzzword obsession. — David Heffernan, Apr 21 '11 at 06:26
@David Lock-free can be slower in **some situations**, but the paper I referenced shows a very good test of the various queues and the lock-free queues did indeed outperform the lock-based queues. — Kiril, Apr 21 '11 at 12:38
why don't you stop using managed code if you want performance? — David Heffernan, Apr 21 '11 at 12:44
@David I don't work in a silo, so while I don't mind switching to non-managed code, I'm not sure if that will jive well with the rest of the company or my boss. The cost of switching to non-managed code would be too great to justify the result. Furthermore, even if the ENTIRE COMPANY decides to go with non-managed code, I'd still like to use the fastest solution possible: lock-free-blocking-queue. — Kiril, Apr 21 '11 at 13:02
my guess is that the performance gain would be negligible. Presumably time spent working with the queue is negligible in comparison to everything else you do? — David Heffernan, Apr 21 '11 at 13:13
@David Presumably it would be, but I would personally find it gratifying if I can solve this "problem." — Kiril, Apr 21 '11 at 13:27
how apt. Raymond Chen has been blogging about lock free algorithms for the past week. Today if says "Remember that switching to a lock- free algorithm should be guided by performance measurements. Switching from a simple algorithm to a complex one shouldn't be done unless you know that the simple algorithm is having trouble." — David Heffernan, Apr 21 '11 at 15:37
@David: If I am able to create a lock-free-blocking-queue, then I'd do the performance measures... hehe. — Kiril, Apr 21 '11 at 15:53
if it was me I'd be sacred of getting it wrong and having very intermittent bugs that only manifested on my clients machine! — David Heffernan, Apr 21 '11 at 15:58
@David I **am** scared to get it wrong and that's why I'm here ;) — Kiril, Apr 21 '11 at 16:02
First, Ive done performance measurements. In our application a lock-free queue was at least 5x faster. Second. The code for a queue using mutexes/cond gets about as complicated. Sometimes the code gets shorter with lock free. For instance I can increment a value with a one line atomic op rather than using 3 lines to lock/unlock a mutex around an increment. — johnnycrash, Apr 21 '11 at 16:35
Back to the question. 1) Why do you need a tail pointer? whenever you do lock free stuff, think "less is more". We have many lock free single linked lists with only a head pointer that has to be CAS'd. That wont solve your race condition though it will make the code simpler. 2) What you need is a 'helper' — johnnycrash, Apr 21 '11 at 16:36
Helper: the idea is that if a CAS fails, then shift responsibility rather than block...or if a race might happen, add a couple extra ways to undo the race rather than block. Maybe you have more than 1 thread always? Make it so a thread that comes in and notices extra work checks for a thread in a coma. Maybe you have a lot of work? Make it so adding work looks for a thread in a coma. — johnnycrash, Apr 21 '11 at 16:42
@Lirik, instead of implementing it all yourself (as in pseudo-code), why not wrap an existing .NET non-concurrent collection. Other than that, your approach seems similar to what both answers suggested. — Danny Varod, Apr 21 '11 at 17:26
@Danny I already have a [blocking queue](http://stackoverflow.com/questions/5739409/checking-a-queuet-continuously/5739429#5739429) that works spectacularly well and it does indeed wrap an existing .NET collection, but as johnnycrash mentioned: the lock-free queue is about 5x faster. So far johnnycrash's solution is probably the closest. — Kiril, Apr 21 '11 at 17:32
You can wrap an existing .NET collection in a lock-free (busy wait) manner, like I suggested. It is similar to what you suggested (loop until success, only the CAS protection is one the whole enqueue/dequeue operation and not on one part. — Danny Varod, Apr 21 '11 at 17:37
blocking queue with locks is trivial to implement. 5x faster is meaningless. You don't have a queue for its own sake. 5x nothing is still nothing. — David Heffernan, Apr 21 '11 at 18:19

score 1 · Answer 1 · edited May 23 '17 at 11:47

.NET parallel extensions: (Built in, for .NET 4.0+):

http://blogs.msdn.com/b/pfxteam/archive/2010/01/26/9953725.aspx

Someone from StackOverflow's implementation:

Lock free constructs in .net

Response to clarification in comments:

If the blocking when empty is not busy (waits for signal), then it seems like you need a counting-semaphore to wait on.

An alternative approach could be using a regular queue, together with atomic compare and exchange or spin lock to prevent simultaneous access,
then if a consumer thread tries to enter when queue is empty, lock binary semaphore,
if a provider thread tries to enter when queue is empty, unlock binary semaphore to awaken all sleeper consumers (and return them to spin-lock, so that multiple threads can only enter if there are enough items in queue for them).

E.g. // pseudo code

/// Non-blocking lock (busy wait)
void SpinLock()
{
    While (CompareAndExchange(myIntegerLock, -1, 0) != 0)
    {
        // wait
    }
}

void UnSpinLock()
{
    Exchange(myIntegerLock, 0);
}

void AddItem(item)
{
    // Use CAS for synchronization
    SpinLock(); // Non-blocking lock (busy wait)

    queue.Push(item);

    // Unblock any blocked consumers
    if (queue.Count() == 1)
    {
        semaphore.Increase();
    }

    // End of CAS synchronization block
    UnSpinLock();
}

Item RemoveItem()
{
    // Use CAS for synchronization
    SpinLock(); // Non-blocking lock (busy wait)

    // If empty then block
    if (queue.Count() == 0)
    {
        // End of CAS synchronization block
        UnSpinLock();

        // Block until queue is not empty
        semaphore.Decrease();

        // Try again (may fail again if there is more than one consumer)
        return RemoveItem();
    }

    result = queue.Pop();

    // End of CAS synchronization block
    UnSpinLock();

    return result;
}

@Danny thanks for the links, but I'm specifically looking for lock-free-blocking-queue and from what I read it seems like a pretty difficult thing to achieve. I wonder if anybody has actually managed to get one working. — Kiril, Apr 20 '11 at 18:05
Blocking client threads if queue is busy? - If so a spin-lock (mutex-free lock / busy wait) surrounding a queue should do the trick. — Danny Varod, Apr 20 '11 at 18:07
@Danny blocking if the queue is empty, there shouldn't be a lock when the queue has items in it. The Scott paper shows the two implementations: a lock-free queue and a blocking queue... both of them are implemented in java. I haven't see a lock-free AND blocking queue (i.e. only blocks when it's empty). — Kiril, Apr 20 '11 at 18:11
@lukas the ConcurrentBag is very good, but uses locking: http://stackoverflow.com/questions/4785622/why-is-concurrentbagt-so-slow-in-net-4-0-am-i-doing-it-wrong — Kiril, Apr 20 '11 at 18:13
@Danny see my edit, I added the pseudo code with what I "envisioned" as a solution. — Kiril, Apr 21 '11 at 13:29

johnnycrash · Accepted Answer · 2011-04-21T17:21:52.010

EDIT:

SIMPLER: I suggest you don't need a head and tail for your queue. Just have a head. If the head = NULL, the list is empty. Add items to head. Remove items from head. Simpler, fewer CAS ops.

HELPER: I suggested in the comments that you need to think of a helper scheme to handle the race. In my version of what "lock free" means, it's ok to have rare race conditions if they don't cause problems. I like the extra performance vs having an idle thread sleep a couple ms too long.

Helper ideas. When a consumer grabs work it could check to see if there is a thread in a coma. When a producer adds work, it could look for threads in comas.

So track sleepers. Use a linked list of sleepers. When a thread decides there is no work, it marks itself as !awake and CAS's itself to head of the sleeper list. When a signal is received to wake up, the thread marks self as awake. Then the newly awakened thread cleans up the sleeper list. To clean up a concurrent single linked list, you have to be careful. You can only CAS to the head. So while the head of the sleeper list is marked awake, you can CAS the head off. If the head is not awake, continue to scan the list and "lazy unlink" (I made that term up) the remaining awake items. Lazy unlink is simple...just set next ptr of prev item over the awake item. A concurrent scan will still make it to the end of the list even if it gets to items that are !awake. Subsequent scans see a shorter list. Finally, any time you add work or pull off work, scan the sleeper list for !awake items. If a consumer notices work remains after grabbing some work (.next work != NULL), the consumer can scan sleeper list and signal the first thread that is !awake. After a producer adds work, the producer can scan the sleeper list and do the same.

If you have a broadcast scenario and cant signal a single thread, then just keep a count of asleep threads. While that count is still > 0, a consumer noticing remaining work and a consumer adding work would broadcast the signal to wake up.

In our environment, we have 1 thread per SMT, so the sleeper list can never be that large (well unless I get my hands on one of those new 128 concurrent thread machines!) We generate work items early in a transaction. In the first sec we might generate 10,000 work items, and this production rapidly tapers off. Threads work for a couple sec on those work items. So, we rarely have a thread on the idle pool.

YOU CAN STILL USE LOCKS If you have 1 thread only and generate work rarely...this wont work for you. In that case the performance of mutexes is of no concern and you should just use them. Use a lock on the sleeper queue in this scenario. Think of lock-free as being "no locks where it counts".

PREVIOUS POST: Are you saying: There is a queue of work. There are many consumer threads. A consumer needs to pull of work and do it if there is any work A consumer thread needs to sleep until there is work.

If you are, we do this using only atomic operations this way:

The queue of work is a linked list. There is also a linked list of sleeping threads.

To add work: CAS the head of the list to the new work. When work is added,we check to see if there are any threads on the sleeper list. If there are, before adding the work, we CAS a sleeper off the sleeper list, set its work = the new work, and then signal the sleeper to wake up. The we add the work to the work queue.

To consume work: CAS the head of the list to head->next. If the head of the work list is NULL, we CAS the thread to a list of sleepers.

Once a thread has a work item, the thread must CAS the work item's state to WORK_INPROGRESS or some such. If that fails, it means the work is being performed by another, so the consumer thread goes back to search for work. If a thread wakes up and has a work item, it still has to CAS the state.

So if work is added, a sleeping consumer is always woken up and handed the work. pthread_kill() always wakes a thread at sigwait(), because even If the thread gets to sigwait after the signal, the signal is received. This solves the problem of a thread putting itself on the sleeper list but getting signaled before going to sleep. All that happens is the thread tries to own its ->work if there is one. Failure to own work or not having work sends the thread back to consume-start. If a thread fails to CAS to the sleeper list, it means that either another thread beat it, or that the producer pulled off a sleeper. For safety, we have the thread act as if it were just woken up.

We get no race conditions doing this and have multiple producers and consumers. We also have been able to expand this to allow threads to sleep on individual work items as well.

@johnnycrash no, I mean the consumer pulls an item off the queue until there is nothing in the queue. When there consumer calls dequeue and there is nothing in the queue then the consumer will get blocked on the dequeue call (this is the standard blocking queue, which I already have). See the referenced paper for a description of both queues... basically the blocking queue locks on dequeue, a lock free queue does not lock (as you said it uses CAS), but a lock-free queue returns if there is nothing in the queue and I would like for the lock-free queue to block instead of return. — Kiril, Apr 20 '11 at 21:54
@Lirik, When the queue is empty, the consumer calls sigwait. Is that what you mean by blocking? Also, we have not found a use for a head and tail in our queues. We just use a head and the enqueuing and dequeuing are much simpler. — johnnycrash, Apr 20 '11 at 22:08
@Lirik, Our GetWork() function blocks when the thread is added to the sleeper list and calls sigwait(). sigwait() sleeps the thread. Later when work appears, we call pthread_kill() to wake the thread. I don't know if .Net has a way to sleep a thread and wake it via signal like linux does. But, long story short... We block when the queue is empty, and we don't use locks anywhere. — johnnycrash, Apr 20 '11 at 23:08
@johnnycrash I'm following you now! OK, I think I know how to get this done: instead of having a list of sleeping threads, I could just have the dequeue block on a ManualResetEvent until there is something in the queue (the ManualResetEvent will be set on enqueue and reset on dequeue when the queue is empty). I'm going to try and jot up some code a bit later tonight... I think I'm getting it now. — Kiril, Apr 21 '11 at 00:21
@Lirik, cool! The key to it working, I believe, is you always wake a sleeper if you add work to the queue. If there are no sleepers...don't wake anything. Watch out for the race to complete the sleep after placing oneself on the sleeper list. If the producer signals a thread before that thread has completed its sleep you *could* get into the situation where all threads are asleep. We don't see this happening because sigwait caches signals. Now you got me thinking there might be a mutex in sigwait() arrrrrgh! — johnnycrash, Apr 21 '11 at 01:12
@johnnycrash I think I end up with a race condition (see the pseudo-code above)... I thought I **almost** had it! — Kiril, Apr 21 '11 at 13:29
Yes the dreaded race between waiting on a signal that might already be being sent to you. — johnnycrash, Apr 21 '11 at 17:53

Are there any implementations of concurrent lock-free blocking queues?

2 Answers2