How to run short asynchronous tasks efficiently under Linux?

Question

I'm trying to meet a soft real-time requirement where a task needs to complete in < 1 ms under Linux. Currently I'm using pthreads with 4 - 8 threads to try and achieve this, but it seems that the overheads and latencies with pthreads under Linux are not well suited to short duration tasks (~100 µs latency each for both pthread_create and pthread_join + some weird non-deterministic behaviour while the threads are running which can add 100 - 200 µs more).

So I'm wondering whether there is any other way to run short async tasks reliably, with reasonably low latency. The tasks are typically < 500 µs and I need them all to complete within 1 ms. Might I be able to use kernel threads (kthreads) directly somehow (e.g. using shared memory for task data) ? Or perhaps something based on interrupts ?

Background information

I have tried playing around with the scheduling options with pthreads on Linux. SCHED_FIFO and SCHED_RR tend to make things worse, regardless of thread priority. Setting thread affinity (pthread_set_affinity_np) helps a little though, as it reduces thread migration between cores.

The current code has also been tested on Mac OS X (which is based on BSD and the Mach kernel) - it works fine with pthreads, and it easily meets the < 1 ms requirement.

It seems that pthreads on Linux is not so well optimised for short duration threads. According to this paper: "The Linux Scheduler: a Decade of Wasted Cores" - there are a number of problems with pthreads on Linux, which seem to have arisen due to the haphazard way in which multi-core support was introduced into the Linux scheduler. My problem does not seem to be related to any of the four problems identified in the paper, but it does suggest that threads on Linux may be something of a curate's egg.

Do you actually need to spawn new threads each time? Why not have a thread-pool? — EOF, Apr 29 '16 at 12:14
@EOF: yes, I've been wondering about that - essentially I would be trading a couple of mutexes for the current `pthread_create`/`pthread_join` calls, plus any latencies involved in waking up sleeping threads etc. It might be worth a try though - I'll try writing some test code for this later to see if it's more efficient. — Paul R, Apr 29 '16 at 12:17
Well, you'd need more than a mutex. You'd need a semaphore or a condition-variable. Still, basically a `futex` under the hood. — EOF, Apr 29 '16 at 12:20
@EOF: thanks, I've never really played with thread pools so I'm a bit fuzzy as to how to do the signalling, but i'll look at futices etc as you suggest. — Paul R, Apr 29 '16 at 12:28
You don't need to use `futex`es yourself. I'd recommend using a semaphore. Specifically a POSIX semaphore, not a SysV semaphore (yuck), and since you're using threads, you can use an unnamed, non `pshared` semaphore from `sem_init()`. — EOF, Apr 29 '16 at 12:32
@EOF: thanks, yes, I was just reading [this question](http://stackoverflow.com/questions/23283225/are-there-any-equivalents-to-the-futex-in-linux-unix) where it seems to suggest that Linux semaphores are built on futices these days anyway, and that they are lower overhead than mutices and condition variables. Sounds promising... — Paul R, Apr 29 '16 at 12:45

score 1 · Answer 1 · 2016-05-01T09:38:36.793

task needs to complete in < 1 ms under Linux

Look, it seems to be not a very strict requirement. There are lots of low-latency financial software that face with even tougher requirements.

There are lots of advice on the Internet on to write low-latency software, for example:

Since you mentioned creating and deleting threads in your question I think you break one of these recomendation (Keep context switches to a minimum). I think you should keep one or a few threads busy waiting, not creating and joining.

Yes, I think you may be right - this came up in the discussion with @EOF in the comments above earlier. I'm just in the middle of writing some test code to try out a thread pool approach with semaphores - I think the crucial issue will be how long it takes to wake up a sleeping thread. — Paul R, Apr 29 '16 at 14:25

score 1 · Answer 2 · answered Apr 29 '16 at 15:16

1

Sounds like all you need is a thread pool. Instead of creating/destroying a thread for each task, you can have long-running worker threads executing tasks from a queue, e.g. multiple-producer-multiple-consumer pattern.

I would start with Intel TBB Task Scheduler.

answered Apr 29 '16 at 15:16

Maxim Egorushkin

131,725
17
180
271

Thanks, yes, the consensus seems to be that thread pools might be the way to go. I did look at TBB but it wasn't clear whether it was layered on top of Linux threads or not, and I'm not sure if I want to be dependent on a proprietary library which is x86-only. It might be something to look at if things get desperate though, before considering switching to another OS. – Paul R Apr 29 '16 at 15:23
@PaulR Most applications use POSIX threads on Linux. There is a lower-level API `clone` but it is rarely used directly. – Maxim Egorushkin Apr 29 '16 at 15:29
Yes, I looked at `clone` as I believe that's also what kernel threads (kthreads) use. Still working on some test code with thread pools and semaphores, but it seems to be buggy... – Paul R Apr 29 '16 at 16:57
@PaulR You could take [this multiple-producer multiple-consumer atomic queue](http://stackoverflow.com/a/9711916/412080) for a spin. Or [a simpler one](http://stackoverflow.com/a/5019461/412080). – Maxim Egorushkin Apr 29 '16 at 17:12
Yes, thanks - I'm not sure what the overhead is on `boost::mutex` though (they both seem to use it) . The Posix semaphores in Linux are based on futices, as I understand it, and these don't trigger any context switches until a thread needs to be woken up. – Paul R Apr 29 '16 at 18:59
@PaulR Mutexes are based on futexes, read up. Only recently were semaphores redesigned to use futexes. Anyway, the cost of mutex lock is atomic exchange, in the non-contended case. – Maxim Egorushkin Apr 29 '16 at 19:13

How to run short asynchronous tasks efficiently under Linux?

Background information

2 Answers2