0

I am running a Python (python3) script that spawns (using fork and not spawn) lots of processes through multiprocessing.Process (e.g 20-30 of them) at the same time. I make sure all of these processes are done (.join()) and don't become zombies. However, despite I am running the same code with the same random seed my job crashes due to a huge spike in memory usage at completely random times (memory usage goes up to a random value between 30GBs to 200GBs from the requested 14GBs all of the sudden). Sometimes my job/script crashes 10 minutes after running, sometimes right at the beginning) and sometimes 10 hours into running. Note that this process is deterministic and I can repeat it but cannot reproduce the crash, which is very weird. What I do with each of those processes is to load an image from disk using the cv2.imread (each might take 0.5MB on memory) and store them into a shared memory (mp.RawArray('f', 3*224*224) or mp.Array('f', 3*224*224)) that I created before running the process! My code creates and processes something around 1500-2000 of these images every minute on the server that I'm running it. It's very annoying to see that sometimes only 100-150 of those images have been read from disk but the job crashes at the very beginning as I'm requesting 25GBs of memory when I submit my job to our servers which use CentOS.

I've tried increasing the requested memory (from 25GBs to 115GBs) on our servers but my script crashes soon or late and at completely random times. Another thing that I noticed is that although I spawn lots of processes and do .start() at the same time, most of those processes do not start running until the ones that are spawned earlier are done first. This is because I do not request lots of cores (e.g. 30) cores when running my job and use 8 cores.

I wonder if people have had similar experiences? I would appreciate your comments/suggestions.

Amir
  • 10,600
  • 9
  • 48
  • 75
  • What you've given here is just the tip of the ice burg, but if the arrays are the reason for the crash it could be because of the dtype of arrays and/or the amount of arrays/items you (can) have in memory at a particular time. That being said, you can either calculate that handy by multiplying number of arrays to their dimension to the required memory for the dtype and see if your RAM satisfy that amount of memory. If it's not possible doing so you can either change/manipulate the dtypes or just use [`mmap()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html) – Mazdak Jun 25 '19 at 02:04
  • @Kasrâmvd Of course my memory has enough space for operations I want to do. The arrays are actually pretty small (just 3*224*224 bytes in memory) and I only have 30-40 of them at the same time in RAM. Btw, I updated my question with more information. Could you please take a look again and see if you have an idea on what might be causing this? I was thinking using a lock would help me so I switched from `mp.RawArray` to `mp.Array` but I'm still getting spikes in memory usage :/ – Amir Jun 28 '19 at 17:46
  • First of all, don't be sure that `join()` method guarantees to terminate your processes (it's better to check with `pmid` or filter with `top | grep procame` if you are on unix). Secondly, the algorithm behind your code may play a very crucial role in this which is absent. Based on what you said if the maximum number of arrays at each pulse isn't more than 30|40 we wouldn't have your memory spikes so unexpectedly. Imao, you have to find the complexity of your algorithm with regard to your objects of use and see what it is. The bottom line is that there could be many causes and you have to..... – Mazdak Jun 28 '19 at 19:55
  • ....Check them thoroughly. – Mazdak Jun 28 '19 at 19:55
  • Here are some relevant questions that may help. Take a look: https://stackoverflow.com/questions/33001155/using-numpy-array-in-shared-memory-slow-when-synchronizing-access https://stackoverflow.com/questions/44747145/writing-to-shared-memory-in-python-is-very-slow https://stackoverflow.com/questions/10263446/cuda-bad-performance-with-shared-memory-and-no-parallelism https://stackoverflow.com/questions/7894791/use-numpy-array-in-shared-memory-for-multiprocessing – Mazdak Jun 28 '19 at 20:05
  • @Kasrâmvd Thank you. I just became suspicious at something ... . What if calling `join()` does not mean that the process has sent all of the data to the shared memory and that's why the memory spike happens? Is there a way to make sure all of the data has been moved to memory before `join()` is called? – Amir Jun 28 '19 at 20:09
  • I'm not sure if that's exactly what join does but it certainly triggers that operation. In python you can check the status of your process or even keeping track of traceback frames before they happen (may need some trial and error) and/or more info on them. However, I don't know if/how you can check if the data's been sent. Please post it if you found anything ;). – Mazdak Jun 28 '19 at 20:16
  • 1
    @Kasrâmvd I think that's the main cause! The process is assumed to be finished (via calling `join()`) but the data is not fully sent to the shared memory and boooom! I'll post if I find anything. – Amir Jun 28 '19 at 20:17
  • @Kasrâmvd I can definitely confirm that manually adding `time.sleep(0.015)` after `join()`ing a process has significantly reduced memory spikes. But I'm still not certain whether that is because that little amount of time could help the data to get fully transferred to the shared memory or not. I posted another relevant question [here](https://stackoverflow.com/questions/56812806/how-to-make-sure-child-process-finishes-copying-data-into-shared-memory-before-j). – Amir Jun 29 '19 at 10:17

0 Answers0