12

I'm trying to profile a multithreaded program I've written on a somewhat large machine (32-cores, 256GB RAM). I've noticed that between runs, the performance of the program can vary drastically (70-80%). I can't seem to find the cause of this giant variance in the program's performance, but by analyzing the result of the 'time' utility on a large number of runs, I've noticed that the number of involuntary context switches correlates highly with program performance (obviously, fewer context switches lead to better performance and vice-versa).

Is there any good way to determine what's causing this context switching? If I can discover the culprit, then maybe I can try to fix the problem. I have a few particular restrictions on tools I can use, however. First, I don't have root privileges on the machine, so any tools requiring such privileges are out. Second, it's a fairly old kernel (RHEL5, kernel 2.6.18), so some of the standard perf-event stuff may not be present. Anyway, any suggestions on how to dig deeper into the cause of this context switching would be greatly appreciated.

update: I decided to test my program on a different (and smaller) machine. The other machine is a 4-core (with hypertheading) linux box with 8Gb of RAM, and a much newer kernel --- 3.2.0 vs 2.6.18 on the other machine. On the new machine, I'm unable to reproduce the bi-modal performance profile. This leads me to believe that the issue is either due to a hardware issue (as was suggested in the comments) or to a particularly pathological case at the kernel level that has since been fixed. My current best hypothesis is that it may be a result of the fact that the new machine has a kernel with the completely fair scheduler (CFS) while the old machine does not. Is there a way to test this hypothesis (to tell the new machine to use a different / older scheduler) without having to recompile an ancient kernel version for the new machine?

nomad
  • 1,809
  • 2
  • 18
  • 33
  • By "involuntary context switches" do you mean "some other process wanted to run", or "MY process did something that caused the system to stop it while the system completed some work, e.g waiting for some file-data to be loaded from disk or network"? – Mats Petersson Jun 23 '13 at 23:39
  • Are you aware of pthread_cond_t? – Varvarigos Emmanouil Jun 23 '13 at 23:39
  • @MatsPetersson yeah - 'First, I don't have root privileges on the machine' does not suggest exclusive use of it. – Martin James Jun 23 '13 at 23:44
  • 3
    Ask the system admin, who does have privilege, to find out. – Martin James Jun 23 '13 at 23:45
  • Just because you are not alone doesn't necessarily mean there isn't enough resource on the machine to run your task. But if other people are sharing the machine, then it's tough - unless of course, you can ask the admin staff to give you some right to raise the priority of your tasks, above other people's tasks, or something else of that sort. If there are competing tasks on a machine, the machine will share the CPU resources between the tasks - that's how a multiuser system works... – Mats Petersson Jun 23 '13 at 23:49
  • While I don't have admin privileges, I am "alone" on the machine (at least I was when I did my previous profiling). The other users with access to the machine are members of our fairly small research group, so if I need some time to run a performance analysis, I can ask them for solo use of the machine for some time. The fact that this happens when I'm alone on the machine suggests that maybe there's an intermittent system process that's causing the interrupts, but I don't know how to check for certain. – nomad Jun 24 '13 at 00:27
  • By "involuntary context switches" I mean that, according to the OS, it's not the result of my processing yielding or waiting, but rather the result of the OS forcibly pre-empting my process for some other purpose. – nomad Jun 24 '13 at 00:28
  • How did you find out that you are not yielding nor waiting? Does your program allocate memory, perform system calls, or use any library that perform those? Is your program memory-I/O-bound? – DanielKO Jun 24 '13 at 01:25
  • Have you tried having a squiz in `top` while your program's running? Whatever's preempting you is probably using a lot of CPU.... – Tony Delroy Jun 24 '13 at 01:26
  • 2
    @DanielKO -- The `time` utility actually breaks down context switches by voluntary / involuntary. The definition of involuntary context switches here are when your process is pre-empted by the OS for some reason other than it voluntarily giving up control (e.g. yielding / waiting). This can happen when its time-slice expires and there is a higher-priority process to be run, and presumably under a number of other conditions as well. – nomad Jun 24 '13 at 01:40
  • Oh, I forgot GNU time has those extra things. How's the number of minor page faults? Also, the other things I asked. – DanielKO Jun 24 '13 at 01:50
  • My guess - the other users have scheduled big cron jobs. – Martin James Jun 24 '13 at 01:50
  • @DanielKO -- The program doesn't allocate memory dynamically, but allocates a fixed quantity of memory up front. It reads in a file from disk and counts the # of occurrences of fixed subwords that occur in the file, so there is substantial I/O being performed. However, I've tested the I/O portion of the program in isolation and that's not the bottleneck (I can read more subwords than I can process). Also, the performance appears uniform when I'm just doing I/O (i.e. I don't see a large number of involuntary context switches resulting in degraded performance). – nomad Jun 24 '13 at 02:00
  • @TonyD -- Yea; I've watched 'top' while the program's running and the strange thing is that it doesn't look like there's a whole lot else going on in the system; mostly just system-level stuff (etc. dbus-daemon, greceptor (it's running ROCKS cluster software) and some other various low-load things). – nomad Jun 24 '13 at 02:12
  • @DanielKO -- The program reports ~3920000 Minor page faults per run (the variance here is tiny) and 0 Major page faults per run. The number of page faults seems independent of whether I get "good" or "bad" performance. – nomad Jun 24 '13 at 02:40
  • Doing any atomic operation? That is not cheap with many CPUs. Otherwise, this problem is still too abstract. You should transform your program into a minimalist sample code and post it. You'll either find the problem yourself, or people will find a fundamental flaw with the algorithm design. – DanielKO Jun 24 '13 at 04:00
  • @DanielKO -- Actually, yes, I am doing atomic operations. The counts for subwords are stored in a large array of atomic integers (std::atomic). I don't understand why that would cause a performance problem sometimes but not others. The performance profile is actually strange in that there are two "regimes"; fast and slow, and every execution of the program falls into one of these (i.e. it never performs at a level between the fast and slow execution times). Is there a way I can dig more deeply into the atomics to see if they're causing a problem? – nomad Jun 24 '13 at 04:11
  • First thing to watch out for is false sharing. I can't really elaborate too much in this space, but check out http://stackoverflow.com/questions/10143676/false-sharing-and-atomic-variables and chapter 4 of the perfbook https://www.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html . Also, it could be a hardware problem; I've worked once with a 264 core SGI UV machine that from time to time would slow down due to a hardware failure, much like your "two regimes" scenario. – DanielKO Jun 24 '13 at 04:27
  • Interestingly, I see a very similar performance profile if I remove the actual atomic writes from the code; so the program just reads in the file and performs some processing on the subwords (e.g. computing a hash function). Very strange. – nomad Jun 24 '13 at 04:59
  • Too bad you're concerned about Linux and not Solaris/Mac OS/FreeBSD - DTrace would make short work of an investigation like this. See this blog post series for details: [pt1](http://blog.bignerdranch.com/1907-hooked-on-dtrace-part-1/), [pt2](http://blog.bignerdranch.com/1968-hooked-on-dtrace-part-2/), [pt3](http://blog.bignerdranch.com/2031-hooked-on-dtrace-part-3/), [pt4](http://blog.bignerdranch.com/2150-hooked-on-dtrace-part-4/) – Dan Jun 30 '13 at 22:40
  • Thanks @gerty3000! I'm actually concerned about performance on Mac OS as well, but I just don't have access to as many machines to test on that platform. Actually, it would be interesting to see if this problem even pops up OSX, as I assume the scheduler is completely different and, if that's the source of the trouble, there may be no trouble at all on OSX. – nomad Jul 01 '13 at 00:25
  • How frequently is it being run? Is it possible that sometimes the file is cached and sometimes it is not? Are you hand-reading the file into memory or using mmap? – kfsone Jul 01 '13 at 05:40
  • The variance I reported is over many back-to-back runs. However, the performance alternates between the "fast" and "slow" regimes. The file is parsed by a reader thread that fills up a double-ended concurrent queue, and data is then pulled from the queue by other threads for processing. However, I don't think it has to do with file caching as the performance regimes seem to disappear when do no processing in the data-processing threads (i.e. when I just read the file). – nomad Jul 01 '13 at 06:14

6 Answers6

8

You mentioned there is 32 cores but what is the exact layout of the hardware? E.g. how many packages the machine has, how many cores, how the cache is shared etc. For sharing this kind of information I personally like sharing the output of likwid-topology -g.

Anyway, there is one piece of non-determinism in your run: thread affinity. The operating system assigns the SW threads to run on specific HW threads somehow without taking into account the knowledge about how the threads communicate (just because it doesn't have that knowledge). That can cause all kinds of effects so for reproducible runs it's a good idea to make sure you pin your SW threads to HW threads in some way (there may be an optimal way, too, but so far I am just talking about determinism).

For pinning (a.k.a. affinity) you can either use explicit Pthread calls or you might try another tool from the Likwid suite called likwid-pin - see here.

If that doesn't get you consistent results, run a good profiler (e.g. Intel VTune) on your workload making sure you capture a faster run and a slower run and then compare the results. In VTune you can use the compare feature that shows two profiles side by side.

Alexey Alexandrov
  • 2,951
  • 1
  • 24
  • 24
  • Thanks for the suggestion. I actually tried to set the affinity of my threads directly using pthreads (and it works), but I still see the performance variability. This leads me to believe that the issue is most likely a result of the old kernel / scheduler. However, I didn't know about likwid tools and they seem very useful. Further, I've convinced the admin to create a sudoers group so that I can actually use these profiling tools. I think that suggestion alone may be worth the bounty. – nomad Jul 03 '13 at 15:10
  • Thanks for the bounty! Interesting that the affinity didn't improve the reproducibility. What I would try to see next (despite using a profiler) is whether the variability comes from user time or kernel time. Mere 'time' utility should suffice for that. As a deeper dive into scheduling issues I would do VTune with Advanced Hotspots with stacks on (that collects context switch information) or ftrace's scheduling tracing capabilities. One important thing I should have asked is how long your workload takes. You said 80% variability - but is is 5 seconds vs. 9 seconds or 5 ms vs. 9 ms? – Alexey Alexandrov Jul 03 '13 at 18:31
  • The variability is over a fairly large time-scale (e.g. 120 vs 200 seconds). If it only happened at very high resolutions (e.g. 5ms vs 9ms) I probably wouldn't be so worried about it, and wouldn't be sure it wasn't just measurement error. – nomad Jul 03 '13 at 21:20
  • 120 vs 200 seconds is really interesting. Do you have any dependency on the network bandwidth in your workload? That can go as tricky as reading some big amount of data from a network share. Depending on the network load you could get different speed. But that wouldn't give you a bi-modal result. Anyway, when you find the ultimate reason it would be very useful to know what it was - keep us posted if you get a chance! :) – Alexey Alexandrov Jul 04 '13 at 05:21
3

I believe your problem is actually a scheduling issue.

One cannot avoid one's process from being pre-empted from the CPU, but the problem is that if a thread is preempted, and then on it's next quantum ends up on a different CPU, or to be more specific on a CPU with a different L2 cache, then it's accesses to memory will all yield cache misses and cause the data to be fetched from memory. On the other hand, if the thread gets scheduled on the same CPU, it is likely that it's data will still be available on the cahce, for instance yielding much faster memory accesses.

Note that this behavior is most likely to happen when you have more and more cores. And since it is sort of "random" where your thread will end up on it's next quantum, then this would explain the randomness of the performance.

There are out there profiling tools that allow you to register where your threads are being scheduled, such as perf for Linux. Usually these tools are particular to the topology where you're executing your program. Unfortunately none other comes to my mind right now. There are also ways of telling the OS to schedule threads on the same (or adyacent) CPUs, so they'll benefit from fewer cache misses. For that you can check this SO question

I would suggest that you ask your admin which tools like these you count with, so you can do a proper profiling and assignment of your thread scheduling.

Community
  • 1
  • 1
cgledezma
  • 612
  • 6
  • 11
1

You can trace the source of context switches in the kernel by using ftrace (using the sched_switch tracer). Also, using perf may help you narrow in on other metrics (cache misses, etc.).

How does the performance variability change as you increase your program from 1 thread up to 32 (and possible beyond that)?

Could it be that your program has shared data (or some other resource) that the individual threads are contending for? Such that, they experience a race condition that is greater when there is more parallel execution (hence the variable performance)? Besides CPU time, what resources does your program's threads contend for?

I think you're going to have to compile the old kernel on your 4 core machine. Doing so would not only be a good educational exercise, if you haven't already done it, but would be valuable in isolating variables in this problem. If you were to do that, I'd also try to match the exact distro, as your problem may not only be limited to the kernel. It could also be something in the user space or perhaps simply in the compilation options for the kernel (that would therefore be matched in the same distro release).

You may want to look into using a profiler, such as gprof. It may give you insight into the (potential) bottleneck. For example, if you're waiting on a write system call or something of that nature.

Here are a few articles that may help trigger an idea or two:

http://halobates.de/lk09-scalability.pdf

http://pdos.csail.mit.edu/papers/linux:osdi10.pdf

Additional source: Why one non-voluntary context switch per second?

Community
  • 1
  • 1
Homer6
  • 15,034
  • 11
  • 61
  • 81
  • Lastly, you could always petition for a kernel upgrade. :-) – Homer6 Jun 28 '13 at 05:12
  • Thanks for the links. I think that you're right that I may, indeed have to end up re-compiling the old kernel for the other machine. – nomad Jul 03 '13 at 15:08
  • 1
    Also, these video lectures have a lot of good tools and techniques that may be useful in your case: http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-172-performance-engineering-of-software-systems-fall-2010/ – Homer6 Jul 03 '13 at 21:49
  • @nomad I've updated this to reference the ftrace tool which will give you further insight into context switches. – Homer6 Jul 13 '13 at 06:12
1

You are mentioning a bi-modal performance profile which you see on one machine and not on the other. It is horrible, but this is normal, even for single threaded applications.

The problem is that there are far too many factors in a Linux system (any kernel, regardless of the scheduler used) which affect the application performance. It starts at address randomisation, and ends with microscopic timing differences escalating to huge context switch delays between processes.

Linux is not a real-time system. It just tries to be as efficient as possible on the average case.

There is a number of things you can do to minimize the performance variance:

Reduce the number of threads to the bare minimum necessary. Do not split different aspects of your problem up into threads. Just split up into threads when really necessary, for example to feed CPUs with independent (!) number crunching jobs. Try to do as much causally connected work in one thread as possible. You threads should require as little communication with each other as possible. In particular you should have no request/response patterns between threads where the latencies add up.

Assume your OS only be able to do about 1000 context switches between threads/processes per second. That means a couple of 100 request/response transactions per second. If you do a benchmark on Linux and you find you can do much more, ignore this.

Try to reduce the memory footprint of vital data. Distributed data tends to trash the cache with very subtle and hard-to-explain effect on performance.

Johannes Overmann
  • 4,914
  • 22
  • 38
0

In multi threaded program, Attach a thread to a particular CPU number and check if you see improvement in performance.

Anand Rathi
  • 790
  • 4
  • 11
0

You best bet when it comes to rescheduling is to use the Kernel Profiler/logger Ftrace. This should show you what threads are being pre-empted by other threads. Unfortunately bottom-half interrupt handlers are not properly labelled so it can be difficult to decipher these.

doron
  • 27,972
  • 12
  • 65
  • 103