Nvidia's NVLink accelerates data transfer between several GPUs on the same machine. I train large models on such a machine using PyTorch.
I see why NVLink would make model-parallel training faster, since one pass through a model will involve several GPUs.
But would it accelerate a data-parallel training process using DistributedDataParallel?