Why does my pytorch distributed training (DDP) code send a SIGKILL signal on its own?

Question

My code runs for a few interactions but before ending the training it sends a SIGKILL for some unknown reason:

backend='nccl'
rank=1
mp.current_process()=<SpawnProcess name='SpawnProcess-2' parent=13950 started>
os.getpid()=13987
setting up rank=1 (with world_size=4)
MASTER_ADDR='127.0.0.1'
44109
backend='nccl'
--> done setting up rank=0
--> done setting up rank=2
--> done setting up rank=1
--> done setting up rank=3
setup process done for rank=0
setup process done for rank=2
setup process done for rank=1
setup process done for rank=3
Starting training...

n_epoch=0
Traceback (most recent call last):
  File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 279, in <module>
    main_distributed()
  File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 188, in main_distributed
    spawn_return = mp.spawn(fn=train, args=(opts,), nprocs=opts.world_size)
  File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 105, in join
    raise Exception(
Exception: process 1 terminated with signal SIGKILL

I don't understand why it does that. I am not incrementally storing anything as training goes so I don't think it should be a memory issue (especially if it trains fine for a few batches.

How do I even start debugging this with the error not giving me any information? Ideas?

In my research I've checked these links but none seem to help:

@LeoGallucci I am not sure if I did. But perhaps make sure that you don't have a process using a lot of space. At one point I remember opening a lot of files by accident in the dataloader and that screwed me up. Perhaps see my other related questions/answers on DDP and gpu issues you might find something there. — Charlie Parker, Jul 29 '21 at 17:16

Why does my pytorch distributed training (DDP) code send a SIGKILL signal on its own?

0 Answers0