I'm trying to profile a multithreaded program I've written on a somewhat large machine (32-cores, 256GB RAM). I've noticed that between runs, the performance of the program can vary drastically (70-80%). I can't seem to find the cause of this giant variance in the program's performance, but by analyzing the result of the 'time' utility on a large number of runs, I've noticed that the number of involuntary context switches correlates highly with program performance (obviously, fewer context switches lead to better performance and vice-versa).
Is there any good way to determine what's causing this context switching? If I can discover the culprit, then maybe I can try to fix the problem. I have a few particular restrictions on tools I can use, however. First, I don't have root privileges on the machine, so any tools requiring such privileges are out. Second, it's a fairly old kernel (RHEL5, kernel 2.6.18), so some of the standard perf-event stuff may not be present. Anyway, any suggestions on how to dig deeper into the cause of this context switching would be greatly appreciated.
update: I decided to test my program on a different (and smaller) machine. The other machine is a 4-core (with hypertheading) linux box with 8Gb of RAM, and a much newer kernel --- 3.2.0 vs 2.6.18 on the other machine. On the new machine, I'm unable to reproduce the bi-modal performance profile. This leads me to believe that the issue is either due to a hardware issue (as was suggested in the comments) or to a particularly pathological case at the kernel level that has since been fixed. My current best hypothesis is that it may be a result of the fact that the new machine has a kernel with the completely fair scheduler (CFS) while the old machine does not. Is there a way to test this hypothesis (to tell the new machine to use a different / older scheduler) without having to recompile an ancient kernel version for the new machine?