I've been using an AWS EC2 instance, with a Tesla K80 GPU, for a while to run TensorFlow code. I have CUDA 9.0 and cuDNN 7.1.4 installed, and I'm using TF 1.12, all of this on Ubuntu 16.04
Everything worked well up to yesterday, but today it seems that the NVidia drivers have stopped running for some reason :
ubuntu@ip-10-0-0-13:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
I checked the drivers:
ubuntu@ip-10-0-0-13:~$ dpkg -l | grep nvidia
rc nvidia-367 367.48-0ubuntu1 amd64 NVIDIA binary driver - version 367.48
ii nvidia-396 396.37-0ubuntu1 amd64 NVIDIA binary driver - version 396.37
ii nvidia-396-dev 396.37-0ubuntu1 amd64 NVIDIA binary Xorg driver development files
ii nvidia-machine-learning-repo-ubuntu1604 1.0.0-1 amd64 nvidia-machine-learning repository configuration files
ii nvidia-modprobe 396.37-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files
rc nvidia-opencl-icd-367 367.48-0ubuntu1 amd64 NVIDIA OpenCL ICD
ii nvidia-opencl-icd-396 396.37-0ubuntu1 amd64 NVIDIA OpenCL ICD
ii nvidia-prime 0.8.2 amd64 Tools to enable NVIDIA's Prime
ii nvidia-settings 396.37-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
It seems that there are 2 different versions present, could that be a problem ? (But I couldn't see why as everything worked before).
Finding this thread, I checked my kernel, which is appearently different from the ones mentionned in the thread:
ubuntu@ip-10-0-0-13:~$ uname -a
Linux ip-10-0-0-13 4.4.0-143-generic #169-Ubuntu SMP Thu Feb 7 07:56:38 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Has anyone run into this problem and know how to fix it ? Thanks in advance for your help !
EDIT:
When trying to upgrade the drivers with @Dehydrated_Mud 's method, I got the following error:
ERROR: The installation was canceled due to the availability or presence of an alternate driver installation. Please see /var/log/nvidia-installer.log for more details.
And the content of the log file:
nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Thu Mar 21 10:56:46 2019
installer version: 384.183
PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
nvidia-installer command line:
./nvidia-installer
--no-drm
--disable-nouveau
--dkms
--silent
--install-libglvnd
Using built-in stream user interface
-> Detected 4 CPUs online; setting concurrency level to 4.
-> Installing NVIDIA driver version 384.183.
-> The NVIDIA driver appears to have been installed previously using a different installer. To prevent potential conflicts, it is recommended either to update the existing installation using the same mechanism by which it was originally installed, or to uninstall the existing installation before installing this driver.
Please review the message provided by the maintainer of this alternate installation method and decide how to proceed:
The package that is already installed is named nvidia-396.
You can upgrade the driver by running:
`apt-get install nvidia-396 nvidia-modprobe nvidia-settings`
You can remove nvidia-396, and all related packages, by running:
`apt-get remove --purge nvidia-396 nvidia-modprobe nvidia-settings`
This package is maintained by NVIDIA (cudatools@nvidia.com).
(Answer: Abort installation)
ERROR: The installation was canceled due to the availability or presence of an alternate driver installation. Please see /var/log/nvidia-installer.log for more details.
Running apt-cache search nvidia | grep -P '^nvidia-[0-9]+\s'
gives:
nvidia-331 - Transitional package for nvidia-331
nvidia-346 - Transitional package for nvidia-346
nvidia-304 - NVIDIA legacy binary driver - version 304.135
nvidia-340 - NVIDIA binary driver - version 340.107
nvidia-361 - Transitional package for nvidia-367
nvidia-352 - Transitional package for nvidia-375
nvidia-367 - Transitional package for nvidia-387
nvidia-375 - Transitional package for nvidia-418
nvidia-387 - NVIDIA binary driver - version 387.26
nvidia-418 - NVIDIA binary driver - version 418.39
nvidia-384 - NVIDIA binary driver - version 384.183
nvidia-390 - NVIDIA binary driver - version 390.116
nvidia-410 - NVIDIA binary driver - version 410.104
nvidia-396 - NVIDIA binary driver - version 396.82