When does one have to call share_memory_() in Pytorch when using distributed training?

Question

I want to train things in parallel/distributed in 1 machine using multiple CPUS or GPUs. Thus my question is when does one ever run:

        # I DONT THINK THIS IS NEEDED, since you don't need to share data...each process loads in own stuff and uses it
        x, y = x.share_memory_(), y.share_memory_()

as my comment says, my belief is that .share_memory_() is never actually needed in pytorch because:

when using gpu you have to move the data to the correct GPU anyway so you would do x.to(rank).
when using cpus, each process would have a copy of their own data (sending data is expensive in distributed training as far as I understand), so there is no need to have it "shared" with .shared_memory_() since they are already reading the same data from disk (or something similar to that I assume).

Are these assumptions correct? Is it true that for data we never need .memory_share_()? Btw the docs say:

Moves the underlying storage to shared memory.

This is a no-op if the underlying storage is already in shared memory and for CUDA tensors. Tensors in shared memory cannot be resized.

Note however that if one is using CPUs only for the model I think we do need:

    else:
    # if we want multiple cpu just make sure the model is shared properly accross the cpus with shared_memory()
    # note that op is a no op if it's already in shared_memory
        model = model.share_memory()
        ddp_model = DDP(model)  # I think removing the devices ids should be fine...?

How to parallelize a training loop ever samples of a batch when CPU is only available in pytorch?

When does one have to call share_memory_() in Pytorch when using distributed training?

0 Answers0