Setting Hugging Face dataloader_num_workers for multi-GPU training

Question

Should the HuggingFace transformers TrainingArguments dataloader_num_workers argument be set per GPU? Or total across GPUs? And does this answer change depending whether the training is running in DataParallel or DistributedDataParallel mode?

For example if I have a machine with 4 GPUs and 48 CPUs (running only this training task), would there be any expected value in setting dataloader_num_workers greater than 12 (48 / 4)? Or would they all start contending over the same resources?

As I understand when running in DDP mode (with torch.distributed.launch or similar), one training process manages each device, but in the default DP mode one lead process manages everything. So maybe the answer to this is 12 for DDP but ~47 for DP?

curious, did this question help you? https://stackoverflow.com/questions/63017931/using-huggingface-trainer-with-distributed-data-parallel — Charlie Parker, Aug 17 '22 at 15:10
do you have an example of a full notebook of how to run ddp with hf's trainer? in particular I want to know if: wrap the model in DDP? change the args to trainer or trainer args in anyway? wrap the optimizer in any distributed trainer (like cherry? cherry is a pytorch lib for things like this) also, what about the init group that is usually needed? Do you know/mind to share code? — Charlie Parker, Aug 17 '22 at 15:19

Setting Hugging Face dataloader_num_workers for multi-GPU training

0 Answers0

Linked