Should the HuggingFace transformers TrainingArguments dataloader_num_workers
argument be set per GPU? Or total across GPUs? And does this answer change depending whether the training is running in DataParallel or DistributedDataParallel mode?
For example if I have a machine with 4 GPUs and 48 CPUs (running only this training task), would there be any expected value in setting dataloader_num_workers greater than 12
(48 / 4)? Or would they all start contending over the same resources?
As I understand when running in DDP mode (with torch.distributed.launch
or similar), one training process manages each device, but in the default DP mode one lead process manages everything. So maybe the answer to this is 12
for DDP but ~47
for DP?