My Python application running on a 64-core Linux box normally runs without a problem. Then after some random length of time (around 0.5 to 1.5 days usually) I suddenly start getting frequent pauses/lockups of over 10 seconds! During these lockups the system CPU time (i.e. time in the kernel) can be over 90% (yes: 90% of all 64 cores, not of just one CPU).
My app is restarted often throughout the day. Restarting the app does not fix the problem. However, rebooting the machine does.
Question 1: What could cause 90% system CPU time for 10 seconds? All of the system CPU time is in my parent Python process, not in the child processes created through Python's multiprocessing or other processes. So that means something of the order of 60+ threads spending 10+ seconds in the kernel. I am not even sure if this is a Python issue or a Linux kernel issue.
Question 2: That a reboot fixes the problem must be a big clue as to the cause. What Linux resources could be left exhausted on the system between my app restarting, but not between reboots, that could cause this problem to get stuck on?
What I've tried so far to solve this / figure it out
Below I will mention multiprocessing a lot. That's because the application runs in a cycle and multiprocessing is only used in one part of the cycle. The high CPU almost always happens immediately after all the multiprocessing calls finish. I'm not sure if this is a hint at the cause or a red herring.
- My app runs a thread that uses
psutil
to log out the process and system CPU stats every 0.5 seconds. I have independently confirmed what it's reporting withtop
. - I've converted my app from Python 2.7 to Python 3.4 because Python 3.2 got a new GIL implementation and 3.4 had the multiprocessing rewritten. While this improved things it did not solve the problem (see my previous SO question which I'm leaving because it's still a useful answer, if not the total answer).
- I have replaced the OS. Originally it was Ubuntu 12 LTS, now it's CentOS 7. No difference.
- It turns out multithreading and multiprocessing clash in Python/Linux and are not recommended together, Python 3.4 now has
forkserver
andspawn
multiprocessing contexts. I've tried them, no difference. - I've checked /dev/shm to see if I'm running out of shared memory (which Python 3.4 uses to manage multiprocessing), nothing
lsof
output listing all resource here- It's difficult to test on other machines because I run a multiprocess Pool of 59 children and I don't have any other 64 core machines just lying around
- I can't run it using threads rather than processes because it just can't run fast enough due to the GIL (hence why I switched to multiprocessing in the first place)
- I've tried using
strace
on just one thread that is running slow (it can't run across all threads because it slows the app far too much). Below is what I got which doesn't tell me much. ltrace
does not work because you can't use-p
on a thread ID. Even just running it on the main thread (no-f
) makes the app so slow that the problem doesn't show up.- The problem is not related to load. It will sometimes run fine at full load, and then later at half load, it'll suddenly get this problem.
- Even if I reboot the machine nightly the problem comes back every couple of days.
Environment / notes:
- Python 3.4.3 compiled from source
- CentOS 7 totally up to date.
uname -a
: Linux 3.10.0-229.4.2.el7.x86_64 #1 SMP Wed May 13 10:06:09 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux (although this kernel update was only applied today) - Machine has 128GB of memory and has plenty free
- I use numpy linked to ATLAS. I'm aware that OpenBLAS clashes with Python multiprocessing but ATLAS does not, and that clash is solved by Python 3.4's
forkserver
andspawn
which I've tried. - I use OpenCV which also does a lot of parallel work
- I use
ctypes
to access a C .so library provided by a camera manufacturer - App runs as root (a requirement of a C library I link to)
- The Python multiprocessing
Pool
is created in code guarded byif __name__ == "__main__":
and in the main thread
Updated strace results
A few times I've managed to strace a thread that ran at 100% 'system' CPU. But only once have I gotten anything meaningful out of it. See below the call at 10:24:12.446614 that takes 1.4 seconds. Given it's the same ID (0x7f05e4d1072c) you see in most the other calls my guess would be this is Python's GIL synchronisation. Does this guess make sense? If so, then the question is why does the wait take 1.4 seconds? Is someone not releasing the GIL?
10:24:12.375456 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.000823>
10:24:12.377076 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.002419>
10:24:12.379588 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.001898>
10:24:12.382324 sched_yield() = 0 <0.000186>
10:24:12.382596 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.004023>
10:24:12.387029 sched_yield() = 0 <0.000175>
10:24:12.387279 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.054431>
10:24:12.442018 sched_yield() = 0 <0.000050>
10:24:12.442157 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.003902>
10:24:12.446168 futex(0x7f05e4d1022c, FUTEX_WAKE, 1) = 1 <0.000052>
10:24:12.446316 futex(0x7f05e4d11cac, FUTEX_WAKE, 1) = 1 <0.000056>
10:24:12.446614 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <1.439739>
10:24:13.886513 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.002381>
10:24:13.889079 sched_yield() = 0 <0.000016>
10:24:13.889135 sched_yield() = 0 <0.000049>
10:24:13.889244 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.032761>
10:24:13.922147 sched_yield() = 0 <0.000020>
10:24:13.922285 sched_yield() = 0 <0.000104>
10:24:13.923628 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.002320>
10:24:13.926090 sched_yield() = 0 <0.000018>
10:24:13.926244 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.000265>
10:24:13.926667 sched_yield() = 0 <0.000027>
10:24:13.926775 sched_yield() = 0 <0.000042>
10:24:13.926964 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) <0.000117>
10:24:13.927241 futex(0x7f05e4d110ac, FUTEX_WAKE, 1) = 1 <0.000099>
10:24:13.927455 futex(0x7f05e4d11d2c, FUTEX_WAKE, 1) = 1 <0.000186>
10:24:13.931318 futex(0x7f05e4d1072c, FUTEX_WAIT, 2, NULL) = 0 <0.000678>