41

I was running a deep learning program on my Linux server and I suddenly got this error.

UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at /opt/conda/conda-bld/pytorch_1603729096996/work/c10/cuda/CUDAFunctions.cpp:100.)

Earlier when I just created this conda environment, torch.cuda.is_available() returned true and I could use CUDA & GPU. But all of a sudden I could not use CUDA and torch.cuda.is_available()returned false. What should I do?

ps. I use GeForce RTX 3080 and cuda 11.0 + pytorch 1.7.0. It worked before but now it doesn't.

maque J
  • 819
  • 1
  • 5
  • 9

4 Answers4

40

I just tried rebooting. Problem solved. Turned out that it was caused by NVIDIA NVML Driver/library version mismatch.

maque J
  • 819
  • 1
  • 5
  • 9
  • 4
    I have the same issue and it disappears consistently when I reboot. However, I would like not to have to reboot. Did you find a solution to this problem? – desmond13 Nov 02 '21 at 10:35
  • maybe you can try some of the solutions in here https://stackoverflow.com/questions/43022843/nvidia-nvml-driver-library-version-mismatch – maque J Nov 03 '21 at 13:58
  • for me it is not a problem of version mismatch, unfortunately. – desmond13 Nov 04 '21 at 14:12
  • Apt may not automatically change the kernel modules. You may be able to just `modprobe nvidia` to update the drivers in use. – mcint Apr 02 '22 at 00:03
12

Try to run nvidia-smi in a different terminal and if you get an error like: NVML: Driver/library version mismatch then basically you have to follow these steps, so you won't have to reboot again:

  1. In a terminal run: lsmod | grep nvidia.
  2. Then unload the module dependent on nvidia driver:
sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia_uvm
  1. Finally, unload the nvidia module: sudo rmmod nvidia.
  2. Now when you try lsmod | grep nvidia, you should get nothing in the terminal output.
  3. Now run nvidia-smi to check if you get the desired output and you are good to go!
  • can you explain why this recipe resolves the issue? – meyerson Jun 21 '22 at 15:23
  • 1
    NVML is an API directly linked to various parameters of your GPU hardware. And your nvidia driver has been built on your hardware. I suspect that when we direct install a pre-build version of any program to run like pytorch or cudatoolkit, it happens to not properly work for the build version install on the GPU. Cloning the version of pytorch and building it from scratch might be a solution but we don't take that much of hassle unless required. You know our code can work much faster if we build the pytorch in our local machine! – Satya Prakash Dash Jun 24 '22 at 10:23
  • I get this error when I run `sudo rmmod nvidia_drm`: `rmmod: ERROR: Module nvidia_drm is in use` – Philipp Dec 04 '22 at 22:08
  • 1
    The answer to this question https://stackoverflow.com/questions/43022843/nvidia-nvml-driver-library-version-mismatch provides a solution to my problem – Philipp Dec 04 '22 at 22:16
  • What if `nvidia-smi` functions well? – SnzFor16Min Aug 04 '23 at 17:16
1

First check the nvidia-fabricmanager service status:

systemctl status nvidia-fabricmanager

If you see that the nvidia-fabricmanager service is in active (running) state, it is running properly, otherwise restart:

systemctl start nvidia-fabricmanager

This works for me!

0

This is my experience:

  • I had a PyTorch 1.12, an Nvidia GeForce RTX 2080, cuda/11.3.1, and cudnn/8.2.4.15-11.4 on my system, and I got CUDA initialization error.

  • The error had been solved by only changing the cudnn version, i.e., I used cudnn/8.2.0.53-11.3 and the error was gone.

ashkan
  • 476
  • 6
  • 14