Pytorch is not working with DistributedDataParallel for multi gpu training

Question

I am trying to train my model on multiple GPUS. I used the libraries and a added a code for it

from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed import init_process_group, destroy_process_group

Initialization

def ddp_setup(rank: int, world_size: int):
   os.environ["MASTER_ADDR"] = "localhost"
   os.environ["MASTER_PORT"] = "12355"
   os.environ["TORCH_DISTRIBUTED_DEBUG"]="DETAIL"
   init_process_group(backend="gloo", rank=0, world_size=1)

my model

 model = CMGCNnet(config,
                     que_vocabulary=glovevocabulary,
                     glove=glove,
                     device=device)

    model = model.to(0)

    if -1 not in args.gpu_ids and len(args.gpu_ids) > 1:
       model = DDP(model, device_ids=[0,1])

it throws following error:

config_yml : model/config_fvqa_gruc.yml cpu_workers : 0 save_dirpath : exp_test_gruc overfit : False validate : True gpu_ids : [0, 1] dataset : fvqa Loading FVQATrainDataset… True done splitting Loading FVQATestDataset… Loading glove… Building Model… Traceback (most recent call last): File “trainfvqa_gruc.py”, line 512, in train() File “trainfvqa_gruc.py”, line 145, in train ddp_setup(0,1) File “trainfvqa_gruc.py”, line 42, in ddp_setup init_process_group(backend=“gloo”, rank=0, world_size=1) File “/home/seecs/miniconda/envs/mucko-edit/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 360, in init_process_group timeout=timeout) RuntimeError: [enforce fail at /opt/conda/conda-bld/pytorch_1544202130060/work/third_party/gloo/gloo/transport/tcp/device.cc:128] rp != nullptr. Unable to find address for: 127.0.0.1localhost. localdomainlocalhost

I tried printing the issue with os.environ["TORCH_DISTRIBUTED_DEBUG"]="DETAIL" it outputs:

Loading FVQATrainDataset... True done splitting Loading FVQATestDataset... Loading glove... Building Model... Segmentation fault

with NCCL background it starts the training but get stuck and doesn’t go further than this :slight_smile:

Training for epoch 0: 0%| | 0/2039 [00:00<?, ?it/s]

I found this solution but where to add these lines? GLOO_SOCKET_IFNAME* , for example export GLOO_SOCKET_IFNAME=eth0` mentioned in https://discuss.pytorch.org/t/runtime-error-using-distributed-with-gloo/16579/3

Can someone help me with this issue?

to seek help. I am hoping to get and answer

Pytorch is not working with DistributedDataParallel for multi gpu training

0 Answers0