I have a VM with 2 V100s and I am training gpt2-like models (same architecture, fewer layers) using the really nice Trainer API from Huggingface. I am using the pytorch back-end.
I am observing that when I train the exact same model (6 layers, ~82M parameters) with exactly the same data and TrainingArguments, training on a single GPU training is significantly faster than on 2GPUs: ~5hrs vs ~6.5hrs.
How would one debug this kind of issue to uderstand what's causing the slowdown?
Extra notes:
- the 2 gpus are both being used (watching nvidia-smi output)
- I am using fp16 precision
- My TrainingArguments values are:
{
"optim": "adamw_torch",
"evaluation_strategy": "epoch",
"save_strategy": "epoch",
"fp16": true,
"gradient_checkpointing": true,
"per_device_train_batch_size": 16,
"per_device_eval_batch_size": 16,
"dataloader_num_workers": 4,
"dataloader_pin_memory": true,
"gradient_accumulation_steps": 1,
"num_train_epochs": 5
}
The output of nvidia-smi topo -m
is:
$ nvidia-smi topo -m
GPU0 GPU1 CPU Affinity NUMA Affinity
GPU0 X SYS 0-11 N/A
GPU1 SYS X 0-11 N/A
I understand that without NVLink inter-gpu communication is not as fast as it could be, but can that be the only cause of a slowdown like the one I'm observing? And if so, is there anything I can do or will I always have slower training times on 2GPUs (thus making multi-gpu training essentially useless)?