My code runs for a few interactions but before ending the training it sends a SIGKILL for some unknown reason:
backend='nccl'
rank=1
mp.current_process()=<SpawnProcess name='SpawnProcess-2' parent=13950 started>
os.getpid()=13987
setting up rank=1 (with world_size=4)
MASTER_ADDR='127.0.0.1'
44109
backend='nccl'
--> done setting up rank=0
--> done setting up rank=2
--> done setting up rank=1
--> done setting up rank=3
setup process done for rank=0
setup process done for rank=2
setup process done for rank=1
setup process done for rank=3
Starting training...
n_epoch=0
Traceback (most recent call last):
File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 279, in <module>
main_distributed()
File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 188, in main_distributed
spawn_return = mp.spawn(fn=train, args=(opts,), nprocs=opts.world_size)
File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 105, in join
raise Exception(
Exception: process 1 terminated with signal SIGKILL
I don't understand why it does that. I am not incrementally storing anything as training goes so I don't think it should be a memory issue (especially if it trains fine for a few batches.
How do I even start debugging this with the error not giving me any information? Ideas?
In my research I've checked these links but none seem to help:
- How does one fix a `Exception: process 0 terminated with signal SIGSEGV` error and if the single gpu code works fine?
- https://github.com/huggingface/transformers/issues/3660
- https://discuss.pytorch.org/t/exception-process-0-terminated-with-signal-sigkill/75570/5
- https://github.com/PyTorchLightning/pytorch-lightning/issues/1590
- Python script terminated by SIGKILL rather than throwing MemoryError
- https://discuss.pytorch.org/t/torch-utils-data-dataloader-issue/92770
- https://www.reddit.com/r/pytorch/comments/mdsljr/why_does_my_pytorch_distributed_training_ddp_code/
- https://www.quora.com/unanswered/Why-does-my-PyTorch-distributed-training-DDP-code-send-a-SIGKILL-signal-on-its-own
- https://github.com/pytorch/pytorch/issues/54823