10

When I launch my main script on the cluster with ddp mode (2 GPU's), Pytorch Lightning duplicates whatever is executed in the main script, e.g. prints or other logic. I need some extended training logic, which I would like to handle myself. E.g. do something (once!) after Trainer.fit(). But with the duplication of the main script, this doesn't work as I intend. I also tried to wrap it in if __name__ == "__main__", but it doesn't change behavior. How could one solve this problem? Or, how can I use some logic around my Trainer object, without the duplicates?

dlsf
  • 332
  • 2
  • 13
  • 1
    Can you provide some code? Everything within fit is expected to be done multiple times since ddp force all nodes to do an init of the model but I assume this is not your question? https://pytorch-lightning.readthedocs.io/en/stable/advanced/multi_gpu.html – Fredrik Feb 27 '21 at 13:51
  • 1
    Thanks for your answer. Yes, that's what I would expect too. However, it seems, not only what's inside the '.fit()' happens in parallel, but also all code around it. E.g. when I run a script 'main.py', where I print some things sequentially and call Trainer.fit(), the prints are duplicated by the number of processes (GPU's). That's clearly not what I would expect. Maybe there's a hack around this, however meanwhile I figured out, native multiprocessing with ddp in PyTorch is lightyears better (at least for research), see my own answer. – dlsf Mar 03 '21 at 19:55

5 Answers5

7

I have since moved on to use the native "ddp" with multiprocessing in PyTorch. As far as I understand, PytorchLightning (PTL) is just running your main script multiple times on multiple GPU's. This is fine if you only want to fit your model in one call of your script. However, a huge drawback in my opinion is the lost flexibility during the training process. The only way of interacting with your experiment is through these (badly documented) callbacks. Honestly, it is much more flexible and convenient to use native multiprocessing in PyTorch. In the end it was so much faster and easier to implement, plus you don't have to search for ages through PTL documentation to achieve simple things. I think PTL is going in a good direction with removing much of the boiler plate, however, in my opinion, the Trainer concept needs some serious rework. It is too closed in my opinion and violates PTL's own concept of "reorganizing PyTorch code, keep native PyTorch code". If you want to use PTL for easy multi GPU training, I personally would strongly suggest to refrain from using it, for me it was a waste of time, better learn native PyTorch multiprocessing.

dlsf
  • 332
  • 2
  • 13
3

Asked this at the GitHub repo: https://github.com/PyTorchLightning/pytorch-lightning/issues/8563

There are different accelerators for training, and while DDP (DistributedDataParallel) runs the script once per GPU, ddp_spawn and dp doesn't.

However, certain plugins like DeepSpeedPlugin are built on DDP, so changing the accelerator doesn't stop the main script from running multiple times.

2

You could quit the duplicated sub-processes by putting the following code after Trainer.fit:

import sys
if model.global_rank != 0:
    sys.exit(0)

where model is inherited from LightningModule, which has a property global_rank specifying the rank of the machine. We could roughly understand it as the gpu id or the process id. Everything after this code will only be executed in the main process, i.e., process with global_rank = 0.

For more information, please refer the documentation https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#global_rank

Laputa
  • 189
  • 1
  • 7
0

Use global variables:

if __name__ == "__main__":
    is_primary = os.environ.get(IS_PTL_PRIMARY) is None
    os.environ[IS_PTL_PRIMARY] = "yes"
    ## code to run on each GPU
    if is_primary:
        ## code to run only once 
dkatsios
  • 1
  • 2
0

From Pytorch Lightning Official Document on DDP, we know that PL intendedly call the main script multiple times to spin off the child processes that take charge of GPUs:

enter image description here

They used the environment variable "LOCAL_RANK" and "NODE_RANK" to denote GPUs. So we can add conditions to bypass the code blocks that we don't want to get executed repeatedly. For example:

import os

if __name__ == "__main__":
    if 'LOCAL_RANK' not in os.environ.keys() and 'NODE_RANK' not in os.environ.keys():
        # code you only want to run once
donets20
  • 319
  • 4
  • 7