I have read these threads [1] [2] [3] [4], and this article.
I think I got how batch size and epochs works with DDP, but I am not sure about the learning rate.
Let's say I have a dataset of 100 * 8 images. In a non-distributed scenario, I set the batch size to 8, so each epoch will do 100 gradient steps.
Now I am in a multi-node multi-gpu scenario, with 2 nodes and 4 GPUs (so world size is 8).
I understand that I need to pass batches of 8 / 8 = 1, because each update will aggregate the gradients from the 8 GPUs. In each worker, the data loader will load still 100 batches, but each of 1 sample. So the whole dataset is parsed exactly once per epoch.
I checked and everything seems like that.
But what about the learning rate? According to the official doc
When a model is trained on M nodes with batch=N, the gradient will be M times smaller when compared to the same model trained on a single node with batch=M*N if the loss is summed (NOT averaged as usual) across instances in a batch (because the gradients between different nodes are averaged). [...] But in most cases, you can just treat a DistributedDataParallel wrapped model, a DataParallel wrapped model and an ordinary model on a single GPU as the same (E.g. using the same learning rate for equivalent batch size).
I understand that the gradients are averaged, so if the loss is averaged over samples nothing changes, while if it is summer we need to account for that. But does 'nodes' refer to the total number of GPUs across all cluster nodes (world size) or just cluster nodes? In my example, would M be 2 or 8? Some posts in the threads I linked say that the gradient is divided 'by the number of GPUs'. How exactly is the gradient aggregated?