86
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu line=265 error=59 : device-side assert triggered
Traceback (most recent call last):
  File "main.py", line 109, in <module>
    train(loader_train, model, criterion, optimizer)
  File "main.py", line 54, in train
    optimizer.step()
  File "/usr/local/anaconda35/lib/python3.6/site-packages/torch/optim/sgd.py", line 93, in step
    d_p.add_(weight_decay, p.data)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:265

How do I resolve this error?

Mateen Ulhaq
  • 24,552
  • 19
  • 101
  • 135
saichand
  • 1,165
  • 1
  • 10
  • 25
  • 6
    try running your script with `CUDA_LAUNCH_BLOCKING=1 python your_script.py` to get a more accuracte stack trace. – McLawrence Aug 05 '18 at 07:16
  • after running with CUDA_LAUNC...=1, I get the error as `/opt/conda/.../THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [0,0,0] Assertion t >= 0 && t < n_classes failed.` This would come around 20 times. then the Traceback follows: `RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1524580978845/work/aten/src/THCUNN/generic/ClassNLLCriterion.cu:116` how to resolve? – saichand Aug 05 '18 at 08:00
  • 12
    This is an error with your target labels: `t >= 0 && t < n_classes`. print your labels and make sure that they are positive and smaller than the number of outputs of your last layer. – McLawrence Aug 05 '18 at 08:04
  • n_classes should be same as the output of the last layer.. Is it right? – saichand Aug 05 '18 at 08:11
  • That's right. Your targets likely assume to high values. – McLawrence Aug 05 '18 at 08:16
  • @McLawrence, my error points me to `return self.apply(lambda x: x.to(device), *keys)` But if I don't use the **to(device)** option, it shows the device mismatch error between CUDA (required for x) and cpu(of actual x in this case) – Kanishk Mair Feb 28 '20 at 05:40

12 Answers12

118

This is usually an indexing issue.

For example, if your ground truth label starts at 1:

target = [1,2,3,4,5]

Then you should subtract 1 for every label instead so that:

target = [0,1,2,3,4]
Mateen Ulhaq
  • 24,552
  • 19
  • 101
  • 135
Rainy
  • 1,181
  • 1
  • 6
  • 3
  • 3
    I can confirm, this was also the cause of error in my case. For example, valid text labels have been converted to 0..n-1 (n being the number of classes). However, NaN values were converted to -1, which sent it off the rails. – Christian Mar 21 '19 at 01:13
  • 2
    @Rainy can you elaborate on "ground truth label starts at 1". What do you mean by that? I gather that the labels are 1 to 5 and to overcome the error the first value in the error should be zero. Am I right? – Kunj Mehta Oct 02 '19 at 14:55
  • @KunjMehta, Not just first value should be zero. Class index should start from zero. e.g. for 6 classes, index values should be from 0 to 5. – Chandra Jan 20 '20 at 04:22
  • I get the error even though I have the setup you offer – Nihat Nov 20 '20 at 14:25
75

In general, when encountering cuda runtine errors, it is advisable to run your program again using the CUDA_LAUNCH_BLOCKING=1 flag to obtain an accurate stack trace.

In your specific case, the targets of your data were too high (or low) for the specified number of classes.

McLawrence
  • 4,975
  • 7
  • 39
  • 51
  • 13
    To add to this, once you get a more accurate stack trace and locate where the issue is, you can move your tensors to CPU. Moving the tensors to CPU will give much more detailed errors. Combining `CUDA_LAUNCH_BLOCKING=1` with moving the tensors to CPU was the only way I was able to solve a problem I spent 3 days on. – Eric Wiener Nov 05 '20 at 01:44
  • How to run this on Kaggle kernel? – curiouscheese Jan 09 '23 at 03:36
14

I encountered this error when running BertModel.from_pretrained('bert-base-uncased'). I found the solution by moving to the CPU when the error message changed to 'IndexError: index out of range in self'. Which led me to this post. The solution was to truncate sentences to length 512.

R Tiffin
  • 171
  • 1
  • 6
  • Good suggestion. The error I got by using CPU as device was very clear; I had written a very basic indexing bug. – Mew May 17 '23 at 14:33
10

One way to raise the "CUDA error: device-side assert triggered" RuntimeError, is by indexing into a GPU torch.Tensor using a list having out of dimension indices.

So, this snippet would raise an IndexError with the message "IndexError: index 3 is out of bounds for dimension 0 with size 3", not the CUDA error

data = torch.randn((3,10), device=torch.device("cuda"))
data[3,:]

whereas, this one would raise the CUDA "device-side assert triggered" RuntimeError

data = torch.randn((3,10), device=torch.device("cuda"))
indices = [1,3]
data[indices,:]

which could mean that in case of class labels, such as in the answer by @Rainy, it's the final class label (i.e. when label == num_classes) that is causing the error, when the labels start from 1 rather than 0.

Also, when device is "cpu" the error thrown is IndexError such as the one thrown by the first snippet.

hdkrgr
  • 1,666
  • 1
  • 12
  • 22
Alan
  • 377
  • 4
  • 7
4

I found I got this error when I had a label with an invalid value.

arame3333
  • 9,887
  • 26
  • 122
  • 205
2

This error can be made more elaborative if you switch to CPU first. Once you switch to CPU, it will show the exact error, which is most probably related to the indexing problem, which is IndexError: Target 2 is out of bounds in my case and could be related in yours case. The issue is "How many classes are you currently using and what is the shape of your output?", you can find the classes like this

max(train_labels)
min(train_labels)

which in my case gave me 2 and 0, the problem is caused by missing 1 index, so a quick hack is to quickly replace all 2s with 1s , which can be done through this code:

train_=train.copy()
train_['label'] =train_['label'].replace(2,1)

then you run the same code and see the results, it should work

class NDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = NDataset(train_encodings, train_labels)
val_dataset = NDataset(val_encodings, val_labels)
test_dataset = NDataset(test_encodings, test_labels)
Shaina Raza
  • 1,474
  • 17
  • 12
2

This occurred for me when the length of the input tokens for an instance was greater than the max for the model, and when the input length was greater than the max_output_length prediction param.

  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community May 25 '23 at 03:44
1

Another situation where this can happen: you are training a dataset with more classes than you last layer expects. It's another unexpected index situation

0

Happened to me multiple time when the target or label of the bce or ce loss would be <= 0.

Valentin
  • 115
  • 2
  • 13
0

This can also be caused by nan values in your model input data. One easy way to "treat" this problem is to convert any that pop up into zeros on the fly:

batch_data = batch_data[batch_data != batch_data] = 0
user2299067
  • 87
  • 11
0

I wish your problem got solved, but I faced with this issue and spent almost 2 hours to solve it, so I will explain problem and solvation method here for people who are like me.
I had this problem because of class labels.
My project was about sentiment analysis with three classes, so I labeled dataset with values: -1, 0, 1 (3 nodes in output layer) that it caused my problem!
So I re-labeled dataset with values 0, 1, 2 and it got solved. It's important to label samples by start at 0 (PyTorch uses index as class label, so you should be careful).
For people who face with error saying set CUDA_LAUNCH_BLOCKING = 1, you should use this command before importing PyTorch: os.environ['CUDA_LAUNCH_BLOCKING'] = "1", and if you faced with same error (no more information about error) you should run script by CPU and try again (this time you probably get new information about problem).

Omid Khalaf Beigi
  • 187
  • 1
  • 2
  • 12
0

I got this error when I was using the Huggingface Transformer model LongformerEncoderDecoder (LED), and setting the decoder length too large. In my case the default maximum length for the decoder was 1024.

Hope this helps someone

martin36
  • 2,253
  • 4
  • 18
  • 27