PyTorch Model Training: RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

Question

After training a PyTorch model on a GPU for several hours, the program fails with the error

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

Training Conditions

Neural Network: PyTorch 4-layer nn.LSTM with nn.Linear output
Deep Q Network Agent (Vanilla DQN with Replay Memory)
state passed into forward() has the shape (32, 20, 15), where 32 is the batch size
50 seconds per episode
Error occurs after about 583 episodes (8 hours) or 1,150,000 steps, where each step involves a forward pass through the LSTM model.

My code also has the following values set before the training began

torch.manual_seed(0)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(0)

How can we troubleshoot this problem? Since this occurred 8 hours into the training, some educated guess will be very helpful here!

Thanks!

Update:

Commenting out the 2 torch.backends.cudnn... lines did not work. CUDNN_STATUS_INTERNAL_ERROR still occurs, but much earlier at around Episode 300 (585,000 steps).

torch.manual_seed(0)
#torch.backends.cudnn.deterministic = True
#torch.backends.cudnn.benchmark = False
np.random.seed(0)

System

PyTorch 1.6.0.dev20200525
CUDA 10.2
cuDNN 7604
Python 3.8
Windows 10
nVidia 1080 GPU

Error Traceback

RuntimeError                              Traceback (most recent call last)
<ipython-input-18-f5bbb4fdfda5> in <module>
     57 
     58     while not done:
---> 59         action = agent.choose_action(state)
     60         state_, reward, done, info = env.step(action)
     61         score += reward

<ipython-input-11-5ad4dd57b5ad> in choose_action(self, state)
     58         if np.random.random() > self.epsilon:
     59             state = T.tensor([state], dtype=T.float).to(self.q_eval.device)
---> 60             actions = self.q_eval.forward(state)
     61             action = T.argmax(actions).item()
     62         else:

<ipython-input-10-94271a92f66e> in forward(self, state)
     20 
     21     def forward(self, state):
---> 22         lstm, hidden = self.lstm(state)
     23         actions = self.fc1(lstm[:,-1:].squeeze(1))
     24         return actions

~\AppData\Local\Continuum\anaconda3\envs\rl\lib\site-packages\torch\nn\modules\module.py in __call__(self, *input, **kwargs)
    575             result = self._slow_forward(*input, **kwargs)
    576         else:
--> 577             result = self.forward(*input, **kwargs)
    578         for hook in self._forward_hooks.values():
    579             hook_result = hook(self, input, result)

~\AppData\Local\Continuum\anaconda3\envs\rl\lib\site-packages\torch\nn\modules\rnn.py in forward(self, input, hx)
    571         self.check_forward_args(input, hx, batch_sizes)
    572         if batch_sizes is None:
--> 573             result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
    574                               self.dropout, self.training, self.bidirectional, self.batch_first)
    575         else:

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

Update: Tried try... except on my code where this error occurs at, and in addition to RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR, we also get a second traceback for the error RuntimeError: CUDA error: unspecified launch failure

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
<ipython-input-4-e8f15cc8cf4f> in <module>
     61 
     62     while not done:
---> 63         action = agent.choose_action(state)
     64         state_, reward, done, info = env.step(action)
     65         score += reward

<ipython-input-3-1aae79080e99> in choose_action(self, state)
     58         if np.random.random() > self.epsilon:
     59             state = T.tensor([state], dtype=T.float).to(self.q_eval.device)
---> 60             actions = self.q_eval.forward(state)
     61             action = T.argmax(actions).item()
     62         else:

<ipython-input-2-6d22bb632c4c> in forward(self, state)
     25         except Exception as e:
     26             print('error in forward() with state:', state.shape, 'exception:', e)
---> 27             print('state:', state)
     28         actions = self.fc1(lstm[:,-1:].squeeze(1))
     29         return actions

~\AppData\Local\Continuum\anaconda3\envs\rl\lib\site-packages\torch\tensor.py in __repr__(self)
    152     def __repr__(self):
    153         # All strings are unicode in Python 3.
--> 154         return torch._tensor_str._str(self)
    155 
    156     def backward(self, gradient=None, retain_graph=None, create_graph=False):

~\AppData\Local\Continuum\anaconda3\envs\rl\lib\site-packages\torch\_tensor_str.py in _str(self)
    331                 tensor_str = _tensor_str(self.to_dense(), indent)
    332             else:
--> 333                 tensor_str = _tensor_str(self, indent)
    334 
    335     if self.layout != torch.strided:

~\AppData\Local\Continuum\anaconda3\envs\rl\lib\site-packages\torch\_tensor_str.py in _tensor_str(self, indent)
    227     if self.dtype is torch.float16 or self.dtype is torch.bfloat16:
    228         self = self.float()
--> 229     formatter = _Formatter(get_summarized_data(self) if summarize else self)
    230     return _tensor_str_with_formatter(self, indent, formatter, summarize)
    231 

~\AppData\Local\Continuum\anaconda3\envs\rl\lib\site-packages\torch\_tensor_str.py in __init__(self, tensor)
     99 
    100         else:
--> 101             nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
    102 
    103             if nonzero_finite_vals.numel() == 0:

RuntimeError: CUDA error: unspecified launch failure

In my case the error was labels. Model outputs 53 values dataset was outputting labels starting from 0 but the cross-entropy was expecting the labels starting from 0. So, changing this fixed my issue. — Ixtiyor Majidov, Aug 07 '21 at 16:45
My problem was actually caused by out-of-memory, but I am not sure why this error comes out instead. — Wey Shi, Jul 28 '22 at 00:27

score 26 · Answer 1 · answered May 28 '20 at 20:26

The error RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR is notoriously difficult to debug, but surprisingly often it's an out of memory problem. Usually, you would get the out of memory error, but depending on where it occurs, PyTorch cannot intercept the error and therefore not provide a meaningful error message.

A memory issue seems to be likely in your case, because you are using a while loop until the agent is done, which might take long enough that you run out of memory, it's just a matter of time. That can also possibly occur rather late, once the model's parameters in combination with a certain input is unable to finish in time.

You can avoid that scenario by limiting the number of allowed actions instead of hoping that the actor will be done in a reasonable time.

What you also need to be careful about, is that you don't occupy unnecessary memory. A common mistake is to keep computing gradients of the past states in future iterations. The state from the last iteration should be considered constant, since the current action should not affect past actions, therefore no gradients are required. This is usually achieved by detaching the state from the computational graph for the next iteration, e.g. state = state_.detach(). Maybe you are already doing that, but without the code it's impossible to tell.

Similarly, if you keep a history of the states, you should detach them and even more importantly put them on the CPU, i.e. history.append(state.detach().cpu()).

Should I interpret this as "Python CUDA libraries are not smart enough to automatically handle memory management and you're running you of GPU RAM because of a memory leak in your program"? — Mikko Rantalainen, Dec 19 '22 at 08:20

score 9 · Answer 2 · answered Sep 28 '20 at 12:58

9

Anyone coming across this error as well as other cudnn/gpu related errors should try to change the model and inputs to cpu, generally the cpu runtime has much better error reporting and will enable you to debug the issue.

In my experience majority of the time the error comes from invalid index on an embedding.

answered Sep 28 '20 at 12:58

Rijul Gupta

1,045
13
20

I never thought this would work, but I actually got it working with this.. apparently I have some wonky cuda installation – dv3 Mar 15 '21 at 15:12
That is exactly what I met most of the time. – Fang WU Aug 10 '21 at 03:49

score 3 · Answer 3 · answered Dec 24 '20 at 20:31

3

Reducing num_workers worked for me :D

answered Dec 24 '20 at 20:31

Vortex

57
6

3

This is more of a comment, than an answer. At 50+ reputation, you may post comments. See [here](https://stackoverflow.com/help/privileges/comment) for details. – costaparas Dec 25 '20 at 00:12

score 2 · Answer 4 · answered Aug 09 '20 at 09:29

2

I ran into the same problem and resolved it by downgrading cudatoolkit to version 10.1. So try to reinstall pytorch with cudatoolkit 10.1.

conda install pytorch torchvision cudatoolkit=10.1

answered Aug 09 '20 at 09:29

zxn Z

21
2

Did you to this in your own machine, or in Google Colab? – kentropy Aug 05 '21 at 15:06

score 0 · Answer 5 · edited Nov 08 '22 at 14:59

0

This might not work for everyone as there could be other factors like workers, installed Cuda version and more.

For me, a system restart fixed it on my Windows 11 machine with an Nvidia Geforce RTX3070 with 8GB memory. My machine had been on for days with many programs getting in and out of the GPU.

edited Nov 08 '22 at 14:59

Jimmy Vlekke

33
4

answered Apr 29 '22 at 09:10

Olafenwa Moses

21
3

score 0 · Answer 6 · answered Dec 23 '22 at 08:13

0

I think reduce batch-size and it will work.

answered Dec 23 '22 at 08:13

I'mD

1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Dec 24 '22 at 08:19

score 0 · Answer 7 · answered Jun 06 '23 at 21:02

0

For me it was because there are two processes from the previous run that somehow weren't killed properly, and they occupied two GPUs, causing the same cudnn error.

The error disappears after killing these two processes

answered Jun 06 '23 at 21:02

tjysdsg

656
8
19

PyTorch Model Training: RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

7 Answers7