I am trying to make some changes to the ResNet-18 model in PyTorch to invoke the execution of another auxiliary trained model which takes in the ResNet intermediate layer output at the end of each ResNet block as an input and makes some auxiliary predictions during the inference phase.
I want to be able to do the auxiliary computation after the computation of a block in parallel to the computation of the next ResNet block so as to reduce the end-to-end latency of the entire pipeline execution on GPU.
I have a base code that works correctly from the functionality perspective, but the execution of the auxiliary model is serial to the computation of the ResNet block. I verified this in two ways -
By adding print statements and verifying the order of execution.
By instrumenting the running time of the original ResNet model (say time t1) and the auxiliary model (say time t2). My execution time is currently t1+t2.
The original ResNet block code (This is the BasicBlock since I am experimenting on ResNet-18). The entire code is available here
class BasicBlock(nn.module):
...
def forward(self, x):
residual = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
if self.downsample is not None:
residual = self.downsample(x)
out += residual
out = self.relu(out)
return out
This is my modification which works in a serial fashion -
def forward(self, x):
if len(x[0]) == self.auxiliary_prediction_size: # Got an Auxiliary prediction earlier
return x
# Do usual block computation
residual = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
if self.downsample is not None:
residual = self.downsample(x)
out += residual
out = self.relu(out)
# Try to make an auxiliary prediction
# First flatten the tensor (also assume for now that batch size is 1)
batchSize = x.shape[0]
intermediate_output = out.view(batchSize, -1)
# Place the flattened on GPU
device = torch.device("cuda:0")
input = intermediate_output.to(device)
# Make auxiliary prediction
auxiliary_input = out.float()
auxiliary_prediction = self.auxiliary_model(auxiliary_input)
if auxiliary_prediction meets some condition:
return auxiliary_prediction
# If no auxiliary prediction, then return intermediate output
return out
Understandably, the above code causes a data dependency between the execution of the auxiliary model and the next block and hence things to happen serially. The first solution I tried was to check if breaking this data dependency reduces latency. I tried doing so by allowing the auxiliary model to execute but not having the auxiliary_prediction return if the condition is met (Note that this would break functionality but this experiment was purely to understand the behavior). Essentially, what I did was -
batchSize = x.shape[0]
intermediate_output = out.view(batchSize, -1)
# Place the flattened on GPU
device = torch.device("cuda:0")
input = intermediate_output.to(device)
# Make auxiliary prediction
auxiliary_input = out.float()
auxiliary_prediction = self.auxiliary_model(auxiliary_input)
if auxiliary_prediction meets some condition:
# Comment out return to break data dependency
#return auxiliary_prediction
# If no auxiliary prediction, then return intermediate output
return out
However, this did not work and upon researching further, I stumbled upon CUDA streams at this Stack Overflow link. I tried incorporating the idea of CUDA streams to solve my problem in the below way -
def forward(self, x):
if len(x[0]) == self.auxiliary_prediction_size: # Got an Auxiliary prediction earlier
return x
s1 = torch.cuda.Stream()
s2 = torch.cuda.Stream()
with torch.cuda.Stream(s1):
# Do usual block computation
residual = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
if self.downsample is not None:
residual = self.downsample(x)
out += residual
out = self.relu(out)
with torch.cuda.Stream(s2):
# Try to make an auxiliary prediction
# First flatten the tensor (also assume for now that batch size is 1)
out_detach = out.detach() # Detach from backprop flow and from computational graph dependency
batchSize = x.shape[0]
intermediate_output = out_detach.view(batchSize, -1)
# Place the flattened on GPU
device = torch.device("cuda:0")
input = intermediate_output.to(device)
# Make auxiliary prediction
auxiliary_input = out_detach.float()
auxiliary_prediction = self.auxiliary_model(auxiliary_input)
if auxiliary_prediction meets some condition:
return auxiliary_prediction
# If no auxiliary prediction, then return intermediate output
return out
However, the output from Nvidia Visual Profiler still indicates that all work is still being done on the default stream and still serialized. Note that I did verify with a small CUDA program that CUDA streams is supported by the CUDA version I am using.
My questions -
Why does breaking the data dependency not cause PyTorch to schedule the computations in parallel? I thought this was the point of the dynamic computation graphs in PyTorch.
Why does using CUDA streams not delegate the computation to non-default streams?
Are there alternative approaches to execute the auxiliary model asynchronously/parallelly to the ResNet block computation?