1

Currently I'm working on my final year project, which involves in developing a multistream CNN to perform action recognition. However, the final output is relying on the output generated by the independent streams (spatial & temporal). My objective is to make the inference process as efficient as possible, so I wish to make the 2 different stream run simultaneously. By default, it would run the forward function sequentially, thus the execution time will be long.

rgb = network1(input1)
of = network2(input2)
final_output = (rgb + of)/2
return final_output

I have gone through some information about PyTorch multiprocessing, and I have tried some example with torch.multiprocessing.Process, however it seems like the execution time took longer than I was expecting it to be. The codes are shown below.

import torch
import torchvision
import torch.multiprocessing as mp
import time

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

net1 = torchvision.models.quantization.mobilenet_v3_large(pretrained=True,quantize=False)
net2 = torchvision.models.quantization.mobilenet_v3_large(pretrained=True,quantize=False)

if __name__ == "__main__":
    inputs = torch.rand(1, 3, 224, 224)
    start = time.time()
    outputs = net1.forward(inputs)
    end = time.time()
    print('Time taken for forward prop on 1 stream: (sequentially)',end-start)
    
    start = time.time()
    outputs = net1.forward(inputs)
    outputs = net2.forward(inputs)
    end = time.time()
    print('Time taken for forward prop on 2 stream: (sequentially)',end-start)
    
    p1 = mp.Process(target=net1.forward, args=(inputs,))
    p2 = mp.Process(target=net2.forward, args=(inputs,))
    start = time.time()
    p1.start()
    p2.start()
    p1.join()
    p2.join()
    end = time.time()
    print('Time taken for forward prop on 2 stream: (parallel)',end-start)

and this is the output:

Time taken for forward prop on 1 stream: (sequentially) 0.08776640892028809
Time taken for forward prop on 2 stream: (sequentially) 0.15159368515014648
Time taken for forward prop on 2 stream: (parallel) 3.8684606552124023

It could be seen that the forward prop is performed sequentially, any idea on how could I make the forward propagation for both network to be performed simultaneously?

talonmies
  • 70,661
  • 34
  • 192
  • 269
Darkerz
  • 11
  • 3

1 Answers1

-1

My objective is to make the inference process as efficient as possible, so I wish to make the 2 different stream run simultaneously.

Your code started with

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

so I'm assuming that your objective was to use CUDA, and the fact that you never moved your data or models to device is an oversight.

torch.multiprocessing creates multiple processes, and multiple processes have different default CUDA streams.

However, running multiple OS threads or processes won't cause a speed-up, since CUDA only actually runs one of them at a time on any given device (1)

Further, even if you used multiple CUDA streams in the same process, this tends not to accelerate computations, in my experience. This is a general CUDA limitation (1), unrelated to PyTorch. Streams are useful for overlapping IO and computation though. For example, you can have one CUDA stream that copies data to GPU, while another is doing inference with a previous batch.


(1) This appears to have changed with the Ampere architecture, as it supports "multi-instance GPU".

MWB
  • 11,740
  • 6
  • 46
  • 91
  • I don't understand this answer at all. What do CUDA streams have to do with the question, which is using the native python multiprocessing module under the hood? There is no use of CUDA streams in the question and your answer has nothing to do with what the code is doing or the results it is producing – talonmies Oct 09 '21 at 05:54
  • Two processes create two *contexts*. The resulting behaviour has nothing to do with CUDA stream semantics. Pytorch explicitly exposes CUDA streams via APIs. Those are not being used here. – talonmies Oct 09 '21 at 10:12
  • Which version of the answer? The original one which insisted this was all related to streams, or the highly revised current one which insists that the GPU can only run operations from one thread or process at a time (which is also incorrect). Multiprocessing and Multithreading are different cases -- threads share a context and can use streams (surprise!) to run asynchronous operations in a parallel. Mulitprocessing is a context per process and while it was true that was a hard context switch 10 years ago, on a number of modern platforms the driver can multitask from several contexts at once – talonmies Oct 12 '21 at 05:34
  • @talonmies *"Which version of the answer?"* The quoted sentence was in v1 (just without "however") *"this was all related to streams"* -- it is. Which streams do you think OP is referring to (multiple times), while the code also mentions `"cuda:0"`? *"on a number of modern platforms the driver can multitask from several contexts at once"* -- which ones? I mentioned that Ampere (at least) must be the exception. – MWB Oct 12 '21 at 06:14