How to measure the context switching overhead of a very large program?

Question

I am trying to measure the impact of CPU scheduler on a large AI program (https://github.com/mozilla/DeepSpeech).

By using strace, I can see that it uses a lot of (~200) CPU threads.

I have tried using Linux Perf to measure this, but I have only been able to find the number of context switch events, not the overhead of them.

What I am trying to achieve is the total CPU core-seconds spent on context switching. Since it is a pretty large program, I would prefer non-invasive tools to avoid having to edit the source code of this program.

How can I do this?

Are you trying to measure total time in the kernel (i.e including time for kernel to perform IO) or just the time spent performing context switches? — Noah, Mar 02 '21 at 23:07
@Noah Just the time performing context switches. My ultimate goal is this: I have a program that spends a lot of time in the kernel, and I want a detailed breakdown of this time (i.e. x percent spent doing syscalls, y percent spent on context-switches, etc.). I know that syscall time can be measured by `strace`. But I don't know how to measure context-swtich time. — Azuresonance, Mar 03 '21 at 02:52
Including time to context switch for ```syscall``` or just between threads? — Noah, Mar 03 '21 at 17:09
Related: [FlexSC: Flexible System Call Scheduling with Exception-Less System Calls](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Soares.pdf) is a paper that has some measurements/simulations of lowered IPC *after* a system-call returns. Even worse when migrating a thread to a new core, or switching between threads on this core, for cache locality. — Peter Cordes, Apr 10 '22 at 10:38

score 3 · Accepted Answer · answered Feb 22 '21 at 07:33

Are you sure most of those 200 threads are actually waiting to run at the same time, not waiting for data from a system call? I guess you can tell from perf stat that context-switches are actually pretty high, but part of the question is whether they're high for the threads doing the critical work.

The cost of a context-switch is reflected in cache misses once a thread is running again. (And stopping OoO exec from finding as much ILP right at the interrupt boundary). This cost is more significant than the cost of the kernel code that saves/restores registers. So even if there was a way to measure how much time the CPUs spent in kernel context-switch code (possible with perf record sampling profiler as long as your perf_event_paranoid setting allows recording kernel addresses), that wouldn't be an accurate reflection of the true cost.

Even making a system call has a similar (but lower and more frequent) performance cost from serializing OoO exec, as well as disturbing caches (and TLB). There's a useful characterization of this on real modern CPUs (from 2010) in a paper by Livio & Stumm, especially the graph on the first page of IPC (instructions per cycle) dropping after a system call returns, and taking time to recover: FlexSC: Flexible System Call Scheduling with Exception-Less System Calls. (Conference presentation: https://www.usenix.org/conference/osdi10/flexsc-flexible-system-call-scheduling-exception-less-system-calls)

You might estimate context-switch cost by running the program on a system with enough cores not to need to context-switch much at all (e.g. a big many-core Xeon or Epyc), vs. on fewer cores but with the same CPUs / caches / inter-core latency and so on. So, on the same system with taskset --cpu-list 0-8 ./program to limit how many cores it can use.

Look at the total user-space CPU-seconds used: the amount higher is the extra amount of CPU time needed because of slowdowns from context switched. The wall-clock time will of course be higher when the same work has to compete for fewer cores, but perf stat includes a "task-clock" output which tells you a total time in CPU-milliseconds that threads of your process spent on CPUs. That would be constant for the same amount of work, with perfect scaling to more threads, and/or to the same threads competing for more / fewer cores.

But that would tell you about context-switch overhead on that big system with big caches and higher latency between cores than on a small desktop.

You are right, the threads could be waiting for data and voluntarily yielding the CPU, causing a context-switch. Is there a way for me to differentiate this case from an involuntary context-swtich caused by timer interrupts? I haven't been able to find a way to count timer interrupts... — Azuresonance, Feb 22 '21 at 11:34
@Azuresonance: counting timer interrupts wouldn't be useful anyway, since unless you're running a tickless kernel, not every timer interrupt causes a context switch. You might get a rough idea by looking at load average (`uptime` or `top`) to see how many tasks on average are running, or would be running if they had a CPU available. But I think that also counts tasks in D state (disk sleep); IDK if that's a factor. — Peter Cordes, Feb 22 '21 at 11:40
@Azuresonance: I don't know how to count non-voluntary context switches. Like maybe you could take the `perf stat` count and subtract the number of `yield()` or `futex()` system calls? And of course sleeping because you need to wait for a lock might mean that some other thread went to sleep while holding a lock, or else it's just taking a long time inside its critical section. Sleeping while holding a lock would amplify the number of context switches caused by the code trying to use more cores than you have. (Most good multi-threaded code chooses a number of threads based on available cores) — Peter Cordes, Feb 22 '21 at 11:43
Simliiar to your perf record method you might be able to isolate the start / end with [PERF_SAMPLE_ADDR](https://man7.org/linux/man-pages/man2/perf_event_open.2.html). I'm not sure how you get the address of context switch code. — Noah, Feb 22 '21 at 19:07
@Noah: Yes, `perf record` uses `PERF_SAMPLE_ADDR` if `perf_event_paranoid` allows it. Linux's context-switch code is `switch_to()`, called by `schedule()` [Context switch internals](https://stackoverflow.com/q/12630214). (The actual user-space register save/restore is in kernel entry/exit, though, except for SIMD/FP registers which are eagerly saved/restored on context switch in modern Linux, at least for x86. So `switch_to()` just has to change current integer registers, including kernel stack pointer, and save/restore FP state. And CR3 page table if new process, not just thread) — Peter Cordes, Feb 23 '21 at 00:10
@PeterCordes So I tried your advice of `perf stat`ing the number of `yield`s, and got a very weird result that don't make any sense. I ran `perf stat 'syscalls:sys_enter_sched_yield' -e 'context-switches' ./my_program` and got 16 millions `yield`s per 3 million context-switches. How could there be more `yield`s than context-switches? I would be grateful if you have any idea. — Azuresonance, Feb 24 '21 at 07:51
@Azuresonance: If there aren't any other tasks ready to run (which aren't already running on another core), yield will return to the same task that called it without actually triggering a context-switch. Or for whatever other reason the scheduler decides to keep running this task. — Peter Cordes, Feb 24 '21 at 07:58

How to measure the context switching overhead of a very large program?

1 Answers1