Why, using Huggingface Trainer, single GPU training is faster than 2 GPUs?

Question

I have a VM with 2 V100s and I am training gpt2-like models (same architecture, fewer layers) using the really nice Trainer API from Huggingface. I am using the pytorch back-end.

I am observing that when I train the exact same model (6 layers, ~82M parameters) with exactly the same data and TrainingArguments, training on a single GPU training is significantly faster than on 2GPUs: ~5hrs vs ~6.5hrs.

How would one debug this kind of issue to uderstand what's causing the slowdown?

Extra notes:

the 2 gpus are both being used (watching nvidia-smi output)
I am using fp16 precision
My TrainingArguments values are:

{
    "optim": "adamw_torch",
    "evaluation_strategy": "epoch",
    "save_strategy": "epoch",
    "fp16": true,
    "gradient_checkpointing": true,
    "per_device_train_batch_size": 16,
    "per_device_eval_batch_size": 16,
    "dataloader_num_workers": 4,
    "dataloader_pin_memory": true,
    "gradient_accumulation_steps": 1,
    "num_train_epochs": 5
}

The output of nvidia-smi topo -m is:

$ nvidia-smi topo -m
        GPU0    GPU1    CPU Affinity    NUMA Affinity
GPU0     X      SYS     0-11            N/A
GPU1    SYS      X      0-11            N/A

I understand that without NVLink inter-gpu communication is not as fast as it could be, but can that be the only cause of a slowdown like the one I'm observing? And if so, is there anything I can do or will I always have slower training times on 2GPUs (thus making multi-gpu training essentially useless)?

do you have an example of a full notebook of how to run ddp with hf's trainer? in particular I want to know if: wrap the model in DDP? change the args to trainer or trainer args in anyway? wrap the optimizer in any distributed trainer (like cherry? cherry is a pytorch lib for things like this) also, what about the init group that is usually needed? Do you know/mind to share code? — Charlie Parker, Aug 17 '22 at 15:18

score 5 · Answer 1 · answered Mar 17 '22 at 22:59

5

Keeping this here for reference. The cause was "gradient_checkpointing": true,. The slowdown induced by gradient checkpointing appears to be larger on 2 GPUs than on a single GPU. I don't really know the cause of this issue, if anyone knows I would really appreaciate someone telling me.

answered Mar 17 '22 at 22:59

ClonedOne

569
4
20

do you have an example of a full notebook of how to run ddp with hf's trainer? in particular I want to know if: wrap the model in DDP? change the args to trainer or trainer args in anyway? wrap the optimizer in any distributed trainer (like cherry? cherry is a pytorch lib for things like this) also, what about the init group that is usually needed? Do you know/mind to share code? – Charlie Parker Aug 17 '22 at 15:18
Not sure if still relevant, but adding since I just found this: From [here](https://discuss.huggingface.co/t/training-using-multiple-gpus/1279/4): "2 GPUs don’t bring a lot of speedup compared to one since you add all those synchronization operations. The main speedup is that you should have double the batch size automatically so less iterations (unless you used max_steps in your command)." – Penguin Mar 22 '23 at 15:01

Why, using Huggingface Trainer, single GPU training is faster than 2 GPUs?

1 Answers1

Linked