I am running a Python (python3
) script that spawns (using fork
and not spawn
) lots of processes through multiprocessing.Process
(e.g 20-30 of them) at the same time. I make sure all of these processes are done (.join()
) and don't become zombies. However, despite I am running the same code with the same random seed my job crashes due to a huge spike in memory usage at completely random times (memory usage goes up to a random value between 30GBs to 200GBs from the requested 14GBs all of the sudden). Sometimes my job/script crashes 10 minutes after running, sometimes right at the beginning) and sometimes 10 hours into running. Note that this process is deterministic and I can repeat it but cannot reproduce the crash, which is very weird. What I do with each of those processes is to load an image from disk using the cv2.imread
(each might take 0.5MB on memory) and store them into a shared memory (mp.RawArray('f', 3*224*224)
or mp.Array('f', 3*224*224)
) that I created before running the process! My code creates and processes something around 1500-2000 of these images every minute on the server that I'm running it. It's very annoying to see that sometimes only 100-150 of those images have been read from disk but the job crashes at the very beginning as I'm requesting 25GBs of memory when I submit my job to our servers which use CentOS.
I've tried increasing the requested memory (from 25GBs to 115GBs) on our servers but my script crashes soon or late and at completely random times. Another thing that I noticed is that although I spawn lots of processes and do .start()
at the same time, most of those processes do not start running until the ones that are spawned earlier are done first. This is because I do not request lots of cores (e.g. 30) cores when running my job and use 8 cores.
I wonder if people have had similar experiences? I would appreciate your comments/suggestions.