18

Is there any good empirical data on the cost of context switching between threads of the same process on Linux (x86 and x86_64, mainly, are of interest)? I'm talking about the number of cycles or nanoseconds between the last instruction one thread executes in userspace before getting put to sleep voluntarily or involuntarily, and the first instruction a different thread of the same process executes after waking up on the same cpu/core.

I wrote a quick test program that constantly performs rdtsc in 2 threads assigned to the same cpu/core, stores the result in a volatile variable, and compares to its sister thread's corresponding volatile variable. The first time it detects a change in the sister thread's value, it prints the difference, then goes back to looping. I'm getting minimum/median counts of about 8900/9600 cycles this way on an Atom D510 cpu. Does this procedure seem reasonable, and do the numbers seem believable?

My goal is to estimate whether, on modern systems, thread-per-connection server model could be competitive with or even outperform select-type multiplexing. This seems plausible in theory, as the transition from performing IO on fd X to fd Y involves merely going to sleep in one thread and waking up in another, rather than multiple syscalls, but it's dependent on the overhead of context switching.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711

1 Answers1

16

(Disclaimer: This isn't a direct answer to the question, it's just some suggestions that I hope will be helpful).

Firstly, the numbers you're getting certainly sound like they're within the ballpark. Note, however, that the interrupt / trap latency can vary a lot among different CPU models implementing the same ISA. It's also a different story if your threads have used floating point or vector operations, because if they haven't the kernel avoids saving/restoring the floating point or vector unit state.

You should be able to get more accurate numbers by using the kernel tracing infrastructure - perf sched in particular is designed to measure and analyse scheduler latency.

If your goal is to model thread-per-connection servers, then you probably shouldn't be measuring involuntary context switch latency - usually in such a server, the majority of context switches will be voluntary, as a thread blocks in read() waiting for more data from the network. Therefore, a better testbed might involve measuring the latency from one thread blocking in a read() to another being woken up from the same.

Note that in a well-written multiplexing server under heavy load, the transition from fd X to fd Y will often involve the same single system call (as the server iterates over a list of active file descriptors returned from a single epoll()). One thread also ought to have less cache footprint than multiple threads, simply through having only one stack. I suspect the only way to settle the matter (for some definition of "settle") might be to have a benchmark shootout...

caf
  • 233,326
  • 40
  • 323
  • 462
  • Unless your connections are long-lived, the cost of extra syscalls to add/remove fds to the epoll set is almost surely going to outweigh any benefit, so I'm rather dismissing it and assuming a traditional select/poll. I agree a benchmark shootout would be fun, but I was hoping to get some preliminary figures on whether I could expect to "win" before hacking together a performance-oriented threaded httpd or something. :-) – R.. GitHub STOP HELPING ICE May 11 '11 at 04:27
  • BTW your point about voluntary vs involuntary is very good. I might adjust the test to simply put a pipe between the threads and measure the time between write and read... – R.. GitHub STOP HELPING ICE May 11 '11 at 04:29
  • Wow, the latency dropped to ~4900 cycles with voluntary switches. In comparison, `read` from a pipe typically takes ~2000 cycles. I think it sounds plausible that a thread-per-connection server could win. – R.. GitHub STOP HELPING ICE May 11 '11 at 04:40
  • @R.: In the threaded case, aren't you going to have a similar overhead at connection setup/teardown time, as you call in to the kernel to wake up the thread that will be handling the new connection with `pthread_cond_signal()` or similar? – caf May 11 '11 at 05:54
  • Yeah but I want to compare the optimal threaded implementation against the optimal event-based implementation, not a sub-optimal event-based implementation. – R.. GitHub STOP HELPING ICE May 11 '11 at 06:01
  • By the way, I would just have all idle threads blocked on `accept`, in which case there's no need to wake one up; the kernel will do it. I'd probably throw in a semaphore (with slightly unconventional usage, essentially just as an atomic counter) to keep track of the number of threads blocked on `accept` and create a new one when there are none left, or exit if there are too many. – R.. GitHub STOP HELPING ICE May 11 '11 at 06:08
  • 1
    @R.: Ahh yes, of course. Still, which one out of `poll()` or `epoll()` is optimal depends on the exact parameters of the problem - such as the typical ratio between active file descriptors and total file descriptors (and the longevity of the connections, as you've noted). – caf May 11 '11 at 06:13
  • My hope is that the threaded solution gives an optimal solution independent of the parameters like active:total ratio or connection rate/lifetime. :-) – R.. GitHub STOP HELPING ICE May 11 '11 at 06:18
  • Unfortunately, thread-based antics are only competitive in languages/runtimes that don't have significant abstraction over the most basic system primitives. – Matt Joiner Jan 22 '12 at 12:04
  • 1
    That's why this question is tagged C. :-) – R.. GitHub STOP HELPING ICE Jan 22 '12 at 17:10