How to achieve lock-free, but blocking behavior?

Question

I'm implementing a lock-free single producer single consumer queue for an intensive network application. I have a bunch of worker threads receiving work in their own separate queues, which they then dequeue and process.

Removing the locks from these queues have greatly improved the performance under high load, but they no longer block when the queues are empty, which in turn causes the CPU usage to skyrocket.

How can I efficiently cause a thread to block until it can successfully dequeue something or is killed/interrupted?

Hi, can you share with me the rps (request per second) you achieved using the approach? I did a similar type of work (implementing a simple HTTP server) so am interested to know it. I don't know how to contact other than commenting in here. Sorry if I bothered you. — Ayub, Aug 28 '12 at 06:05
@Ayub Performance was alright. RPS isn't a good unit for measuring performance due to different hardware setups, etc. I redesigned the application to allow worker threads to operate in complete isolation, and the performance gain was ~10x. Sharing less data was truly the key. — haste, Aug 28 '12 at 16:27
Can you explain why you chose an approach of one queue per worker? Sounds pretty suboptimal to me. Execution time of jobs on the queues is hard to foresee. — class stacker, Dec 02 '15 at 09:29

Jason · Accepted Answer · 2011-05-22T19:24:35.030

16

If you're on Linux, look into using a Futex. It provides the performance of a non-locking implementation by using atomic operations rather than kernel calls like a mutex would, but should you need to set the process to idle because of some condition not being true (i.e., lock-contention), it will then make the appropriate kernel calls to put the process to sleep and wake it back up at a future event. It's basically like a very fast semaphore.

edited May 22 '11 at 19:24

answered May 22 '11 at 18:44

Jason

31,834
7
59
78

Helpful clarification! I upvoted both Futex-related answers for now. Thanks. – haste May 22 '11 at 19:22
+1 for `futex`. It's performance is not as good as lock-free, but it is good enough and is the perfect choice when mutex locking is too much. pthread mutex API is using `futex` under the scenes. – Jun 15 '11 at 20:26
3

Mutexes on Linux are implemented with a short `cmpxchg` spin for the low contention case, and falling back to a `futex` call. I don't really understand why you're calling it non-locking when it implements locking (fast userspace mutex - the origin of the name). – strcat Dec 26 '13 at 21:46
I'm calling it non-locking because of the low-contention case which is what "typically" occurs unless you're under high load ... – Jason Jan 01 '14 at 17:25

Alexey Kukanov · Answer 2 · 2011-05-23T07:28:55.990

9

On Linux, futex can be used to block a thread. But be aware that Futexes Are Tricky!

UPDATE: condition variables are much safer to use than futexes, and are more portable. However, a condition variable is used in combination with a mutex, so strictly speaking the result will not be lock-free anymore. However, if your primary goal is performance (and not the guaranty of global progress), and the locked portion (i.e. a condition to check after thread wakeup) is small, it might happen that you will get satisfactory results without the need to go into subtleties of integrating futexes into the algorithm.

edited May 23 '11 at 07:28

answered May 22 '11 at 18:45

Alexey Kukanov

12,479
2
36
55

This looks very interesting. I will explore this shortly and return with my verdict. Thank you. – haste May 22 '11 at 19:19
1

The `futex` call is *an implementation of locking*. The mutex API just spins on `cmpxchg` a bit and then falls back to `futex` (fast userspace *mutex*). – strcat Dec 26 '13 at 21:49

score 7 · Answer 3 · answered May 23 '11 at 01:40

If you're on Windows, you won't be able to use futexes, but Windows Vista has a similar mechanism called Keyed Events. Unfortunately, this isn't part of the published API (it's an NTDLL native API), but you can use it as long as you accept the caveat that it might change in future versions of Windows (and you don't need to run on pre-Vista kernels). Be sure to read the article I linked above. Here's an untested sketch of how it might work:

/* Interlocked SList queue using keyed event signaling */

struct queue {
    SLIST_HEADER slist;
    // Note: Multiple queues can (and should) share a keyed event handle
    HANDLE keyed_event;
    // Initial value: 0
    // Prior to blocking, the queue_pop function increments this to 1, then
    // rechecks the queue. If it finds an item, it attempts to compxchg back to
    // 0; if this fails, then it's racing with a push, and has to block
    LONG block_flag;
};

void init_queue(queue *qPtr) {
    NtCreateKeyedEvent(&qPtr->keyed_event, -1, NULL, 0);
    InitializeSListHead(&qPtr->slist);
    qPtr->blocking = 0;
}

void queue_push(queue *qPtr, SLIST_ENTRY *entry) {
    InterlockedPushEntrySList(&qPtr->slist, entry);

    // Transition block flag 1 -> 0. If this succeeds (block flag was 1), we
    // have committed to a keyed-event handshake
    LONG oldv = InterlockedCompareExchange(&qPtr->block_flag, 0, 1);
    if (oldv) {
        NtReleaseKeyedEvent(qPtr->keyed_event, (PVOID)qPtr, FALSE, NULL);
    }
}

SLIST_ENTRY *queue_pop(queue *qPtr) {
    SLIST_ENTRY *entry = InterlockedPopEntrySList(&qPtr->slist);
    if (entry)
        return entry; // fast path

    // Transition block flag 0 -> 1. We must recheck the queue after this point
    // in case we race with queue_push; however since ReleaseKeyedEvent
    // blocks until it is matched up with a wait, we must perform the wait if
    // queue_push sees us
    LONG oldv = InterlockedCompareExchange(&qPtr->block_flag, 1, 0);

    assert(oldv == 0);

    entry = InterlockedPopEntrySList(&qPtr->slist);
    if (entry) {
        // Try to abort
        oldv = InterlockedCompareExchange(&qPtr->block_flag, 0, 1);
        if (oldv == 1)
            return entry; // nobody saw us, we can just exit with the value
    }

    // Either we don't have an entry, or we are forced to wait because
    // queue_push saw our block flag. So do the wait
    NtWaitForKeyedEvent(qPtr->keyed_event, (PVOID)qPtr, FALSE, NULL);
    // block_flag has been reset by queue_push

    if (!entry)
        entry = InterlockedPopEntrySList(&qPtr->slist);
    assert(entry);

    return entry;
}

You could also use a similar protocol using Slim Read Write locks and Condition Variables, with a lockless fast path. These are wrappers over keyed events, so they may incur more overhead than using keyed events directly.

Yeah, but the futex answers were already taken and I thought it might be useful for someone else searching on the subject later :) — bdonlan, May 23 '11 at 13:12
For completeness, this is interesting--I do occasionally develop for Windows. :-) — haste, May 23 '11 at 14:29
+1, Also keyed events work fine on pre-Vista, too (they were initually implemented as a single-global-handle-for-all backoff for an out-of-handle deadlock situation with critical sections under 2k). The only difference between 2k/XP and Vista/7/8 is an implementation detail (linked list vs. hash) which makes Vista and later KEs much more efficient if you watch many memory locations with a single handle (no practial difference for 99% of all applications). — Damon, Aug 22 '13 at 15:49

score 1 · Answer 4 · answered May 22 '11 at 18:49

1

Have you tried conditional waiting? When the queue becomes empty, just start waiting for a new job. The thread putting jobs in the queue should fire the signal. This way you only use locks when the queue is empty.

https://computing.llnl.gov/tutorials/pthreads/#ConditionVariables

answered May 22 '11 at 18:49

azyoot

1,162
8
18

1

Any calls made using pthread functions will make a call into the kernel, which will be significantly slower than atomic operations. The beauty of the futex is that you're using a combination of atomics and a wait-queue at the kernel level, but if there is no lock contention, then you never touch the kernel-level wait queue. On the other-hand, using condition variables automatically places the thread in a wait-queue, so with condition variables, you're going through the kernel no matter what, which is going to be a lot slower if 90% of the time an atomic operation would be fine. – Jason May 22 '11 at 19:28
5

@Jason: pthread mutexes on Linux do not enter the kernel in the non-contended case; neither does a cvar signal with no waiters. Of course, a cvar wait must enter the kernel, but so must any blocking operation – bdonlan May 22 '11 at 19:42
@bdonlan: Thanks for the info ... but if that's the case, then what makes a futex faster than a normal pthread mutex? According to the paper *Fuss, futexes and furwocks: Fast Userlevel Locking in Linux* by Franke, Russell, and Kirkwood, the reasoning behind the futex was to avoid a kernel call if there was no lock contention, i.e., when that paper was written, at least the impression I got from reading it was that mutexes and semaphores always required a kernel call. If they never accessed the kernel then what was the point of the futex? – Jason May 23 '11 at 00:48
5

@Jason, The futex is the underlying primitive used to create NPTL pthread mutexes. It's a bit complex and difficult to use futex() safely on its own, so the pthread library wraps up some of the more common futex protocols, and calls them mutexes and condition variables. It's also more portable this way - futex() is a Linux-only syscall. – bdonlan May 23 '11 at 01:14
1

@bdolan: Appreciate the clarification. Does that mean though that futexes and pthread mutexes are the same speed in Linux, and there's no benefit to using futexes since mutexes are futexes? Or is there a still a performance penalty for mutexes, and if so, from what? – Jason May 23 '11 at 04:06
1

A mutex is implemented with a short spin to handle the low-contention case, and then a system call via `futex`. It doesn't make sense to use `futex` directly if you don't know what you're doing. – strcat Dec 26 '13 at 21:48
@Jason There's really no significant advantage of using `futex` on Linux if all you need is a `pthread_mutex`. `pthread_mutex` is written in assembly language and even avoids saving CPU register contents on the stack when callung `futex` because it knows that `futex` does not modify them. – class stacker Dec 02 '15 at 09:12

score 1 · Answer 5 · answered Jun 15 '11 at 20:15

1

You can cause a thread to sleep by using the sigwait() function. You can wake the thread with pthread_kill. This is much faster than condition variables.

answered Jun 15 '11 at 20:15

johnnycrash

5,184
5
34
58

1) Please provide a reference for your claim that a signal-based mechanism will be "much faster" than a condition variable. In the case that the thread has to wake up, in both cases the scheduling and cacheing aspects play a huge role. 2) Please provide an outline of how to handle race conditions and the case that the thread does not go to sleep but rather picks up an entry from the non-empty queue. In practice, this fast path is the key factor to scalability/performance under load. And it's getting a bit tricky if you have a mixed concept. – class stacker Dec 02 '15 at 09:08

score 0 · Answer 6 · answered May 22 '11 at 18:40

0

You could add sleeps while it's waiting. Just pick the biggest wait you're willing to have, then do something like this (pseudocode because I don't remember pthread syntax):

WAIT_TIME = 100; // Set this to whatever you're happy with
while(loop_condition) {
   thing = get_from_queue()
   if(thing == null) {
       sleep(WAIT_TIME);
   } else {
       handle(thing);
   }
}

Even something short like a 100 ms sleep should significantly lower the CPU usage. I'm not sure at what point the context switching will make it worse than busy waiting though.

answered May 22 '11 at 18:40

Brendan Long

53,280
21
146
188

I considered this but the sleep time would have to be ridiculously low as the application is expected to complete requests within a millisecond or two. – haste May 22 '11 at 19:17
This doesn't solve the problem as elegantly as other answers. – rmcclellan Jun 24 '15 at 12:51
@rmcclellan "Not the best" is usually not a good reason for a downvote, but do what you want. I realize that this solution isn't the best one for the OP (as they comment above), but Stack Overflow answers also come up in search results and this *is* a good solution in some cases (bursty traffic where you need no locks while handling results, but don't want to waste CPU time when nothing is happening). – Brendan Long Jun 24 '15 at 14:33
It might be nice to mention in the answer the places where this solution is a good one. – rmcclellan Jun 24 '15 at 16:10

How to achieve lock-free, but blocking behavior?

6 Answers6

Linked