44

I just got this message when trying to run a feed forward torch.nn.Conv2d, getting the following stacktrace:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-26-04bd4a00565d> in <module>
      3 
      4 # call training function
----> 5 losses = train(D, G, n_epochs=n_epochs)

<ipython-input-24-b539315e0aa0> in train(D, G, n_epochs, print_every)
     46                 real_images = real_images.cuda()
     47 
---> 48             D_real = D(real_images)
     49             d_real_loss = real_loss(D_real, True) # smoothing label 1 => 0.9
     50 

~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548             result = self._slow_forward(*input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)

<ipython-input-14-bf68e57c25ff> in forward(self, x)
     48         """
     49 
---> 50         x = self.leaky_relu(self.conv1(x))
     51         x = self.leaky_relu(self.conv2(x))
     52         x = self.leaky_relu(self.conv3(x))

~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548             result = self._slow_forward(*input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)

~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
     98     def forward(self, input):
     99         for module in self:
--> 100             input = module(input)
    101         return input
    102 

~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548             result = self._slow_forward(*input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)

~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py in forward(self, input)
    347 
    348     def forward(self, input):
--> 349         return self._conv_forward(input, self.weight)
    350 
    351 class Conv3d(_ConvNd):

~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py in _conv_forward(self, input, weight)
    344                             _pair(0), self.dilation, self.groups)
    345         return F.conv2d(input, weight, self.bias, self.stride,
--> 346                         self.padding, self.dilation, self.groups)
    347 
    348     def forward(self, input):

RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

Running nvidia-smi shows:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 770     On   | 00000000:01:00.0 N/A |                  N/A |
| 38%   50C    P8    N/A /  N/A |    624MiB /  4034MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0                    Not Supported                                       |
+-----------------------------------------------------------------------------+

I'm using Python 3.7, Pytorch 1.5, and GPU is Nvidia GeForce GTX 770, running on Ubuntu 18.04.2. I haven't found that error message anywhere. Does it ring any bell?.

Thanks a lot in advance.

Francisco Ramos
  • 443
  • 1
  • 4
  • 6

9 Answers9

101

According to this answer for similar issue with tensorflow, it could occur because the VRAM memory limit was hit (which is rather non-intuitive from the error message).

For my case with PyTorch model training, decreasing batch size helped. You could try this or maybe decrease your model size to consume less VRAM.

Mikhail Kotyushev
  • 1,110
  • 1
  • 4
  • 6
  • 4
    Thanks for your answer, just a small heads-up, this happens with multiple things, but the most common one is the one you mentioned! – MEH Mar 15 '22 at 13:00
  • 6
    I saw this error in train Yolov5. decrease batch-size ssolved the problem. – Sajjad Aemmi Aug 02 '22 at 07:06
6

This error is quite tricky sometimes. For some certain circumstances, out of memory will also report this error info.

Marcus
  • 71
  • 1
  • 3
  • 1
    Exactly, I happened to encounter this problem sometimes triggered by multiple processes running on one GPU, which compete for limited memory simultaneously. – Ember Xu Apr 22 '23 at 13:58
0

I got this error when inference speed testing different EC2 nodes machine. When I digged thru the logs, I found this:

(pid=20839) /home/ubuntu/src/skai-ml/venv/lib/python3.7/site-packages/torch/cuda/__init__.py:87: UserWarning: 
(pid=20839)     Found GPU0 GRID K520 which is of cuda capability 3.0.
(pid=20839)     PyTorch no longer supports this GPU because it is too old.
(pid=20839)     The minimum cuda capability that we support is 3.5.

Lesson learned: don't use g2.XX instance types for PyTorch models. g3.XX and p series worked fine.

crypdick
  • 16,152
  • 7
  • 51
  • 74
0

Check the number of classes you assign in the code. This error appeared to me when I tried to run the code on Cifar100 instead of Cifar10 but forgot to change the num_classes from 10 to 100.

0

A better approach to debugging is to run computations on the CPU it will through an actual error message.

  1. It could be a class mismatch
  2. I was running a segmentation model my mask had a different number of classes than class indices predicted by model
  3. My transforms were wrong on masks
Mahmood Hussain
  • 423
  • 5
  • 14
0

First, confirm the compatibility between the PyTorch version and the CUDA version. If the versions are correct and they are compatible, then the higher batch size can also be a cause for this issue. If you load the optimizer with the CPU, then the batch size should be under the threshold of available RAM memory. And for GPU cuda tensor load optimizer, the batch size needs to be decided with the available GPU memory.

0

there can be several reasons why this error is popped up.

  1. it might not be able to find cuDNN - You can check if you're running the correct python environment.
  2. running more than one file which is using gpu.
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jul 21 '23 at 11:48
0

Honestly, this usually comes from input/output tensor size mismatches, as in the data of size N had been parameterized and is known by the compiler under as size M, so it back propagates M times in the C++ engine, getting a segmentation fault with no directives or error messages in the PyTorch wrapper.

nini2352
  • 11
  • 2
-1

the problem is you are using torch.nn.Module for the feed-forward but you are returning with the functional module F.conv2d(). change your return code to nn.Conv2d()

this will probably help you more- https://pytorch.org/docs/stable/nn.html?highlight=conv2d#torch.nn.Conv2d

Pankaj Mishra
  • 445
  • 6
  • 15