2

I'm following this guide without changing anything. I'm using an aws server with deep learning ami: Deep Learning AMI (Ubuntu 18.04) Version 40.0

I've tried to change my custom dataset to the coco dataset and to a small subset of the custom one. batch size doesn't seems to matter, CUDA and other drivers seems to work.

The exception is thrown when the batch starts the training process. This is the full stack trace:

Logging results to runs/train/exp66
Starting training for 5 epochs...

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
  0%|                                                                                                                                                                                                                 | 0/22 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "train.py", line 533, in <module>
    train(hyp, opt, device, tb_writer, wandb)
  File "train.py", line 298, in train
    pred = model(imgs)  # forward
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/yolov5/models/yolo.py", line 121, in forward
    return self.forward_once(x, profile)  # single-scale inference, train
  File "/home/ubuntu/yolov5/models/yolo.py", line 137, in forward_once
    x = m(x)  # run
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/yolov5/models/common.py", line 113, in forward
    return self.conv(torch.cat([x[..., ::2, ::2], x[..., 1::2, ::2], x[..., ::2, 1::2], x[..., 1::2, 1::2]], 1))
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/yolov5/models/common.py", line 38, in forward
    return self.act(self.bn(self.conv(x)))
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 399, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
talonmies
  • 70,661
  • 34
  • 192
  • 269
Netanel
  • 459
  • 1
  • 5
  • 17
  • Not sure if this is helpful, but in my case, I had to reduce the batch size. So it could also be an out of memory situation, see also [RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED using pytorch](https://stackoverflow.com/q/66588715/5193830), specifically [this](https://stackoverflow.com/a/69808279/5193830) answer. – Valentin_Ștefan May 25 '22 at 17:05

3 Answers3

1

I don't know why but it seems as torch 1.8 is built on older version of cuda. Also as pytorch has its own cuda it seems to doesn't care what version you have on your machine. Changing the torch version (and matching compatible tochvision) solved my problem.

In my case I did as follows:

  1. Changed two lines in "requirements.txt":

torch==1.7.1

torchvision==0.8.2

  1. Created fresh conda environment with python=3.8
  2. Activated the environment
  3. Installed requirements from modified file:

$ pip install -r requirements.txt

Hope it'll help to someone :)

Tailwhip
  • 29
  • 3
0

I fixed it using conda, I cloned the pytorch environment one which came with the image and it works perfectly. I still don't know the cause though.

Netanel
  • 459
  • 1
  • 5
  • 17
0

I ran into something similar when trying to train yolov5 in a script. I found that upgrading to torch==1.9.0 and torchvision==0.10.0 also works (in case you dont want to downgrade as mentioned above)

qwertyuiop
  • 21
  • 2