0

I got an error Failed to initialize NVML: Driver/library version mismatch on a cloud virtual machine for unknown reasons, the system was working normally then suddenly crashed and reported such an error, enter image description here

I'm very confused and don't know what is the cause, can someone with experience in this matter please help me, I want to know why I get such an error and is there any way to prevent it, thanks

Quang Vũ
  • 21
  • 2
  • 1. Stackoverflow is more intended for programming questions rather than configuration/installation of drivers so the question is off-topic. You may have more luck in [Super User](https://superuser.com/) or [Server Fault](https://serverfault.com/). 2. Nothing can be read from the image; never upload errors as images but as plain text – Puteri Jun 29 '23 at 01:07

1 Answers1

2

As per this doc curated by Bright computing knowledge base the “Failed to initialize NVML: Driver/library version mismatch?” error generally means the CUDA Driver is still running an older release that is incompatible with the CUDA toolkit version currently in use.

Rebooting the VM is the easiest way to fix the issue. Rebooting the VM will ensure that the drivers are properly initialized after the upgrade.

If you do not wish to reboot the VM, you will need to remove the existing Nvidia kernel module and load the new module.

On the VM:

Remove the existing Nvidia kernel module:

modprobe -r nvidia nvidia_uvm

Reload the systemd units:

systemctl daemon-reload

Build and load the new kernel module:

systemctl restart cuda-driver

If the old Nvidia Kernel module is still loading, you may need to delete the module from the software image and node. You can check this with:

find /lib/modules | grep nvidia
find /cm/images/default-image/lib/modules | grep nvidia

Refer to this official document to get rid of all previous CUDA and NVIDIA driver files, follow the steps in the cuda linux installation guide and then reinstall.

Sai Chandini Routhu
  • 750
  • 1
  • 3
  • 13
  • Thanks for your answer, I managed to fix it this way. However, I don't know why I have that problem, the virtual machine is working properly, I also did not update the driver version, but it still gives me the error, I don't know why, is it due to automatic mechanism some update? – Quang Vũ Jul 03 '23 at 04:27