7

I've been using an AWS EC2 instance, with a Tesla K80 GPU, for a while to run TensorFlow code. I have CUDA 9.0 and cuDNN 7.1.4 installed, and I'm using TF 1.12, all of this on Ubuntu 16.04

Everything worked well up to yesterday, but today it seems that the NVidia drivers have stopped running for some reason :

ubuntu@ip-10-0-0-13:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

I checked the drivers:

ubuntu@ip-10-0-0-13:~$ dpkg -l | grep nvidia
rc  nvidia-367                              367.48-0ubuntu1                            amd64        NVIDIA binary driver - version 367.48
ii  nvidia-396                              396.37-0ubuntu1                            amd64        NVIDIA binary driver - version 396.37
ii  nvidia-396-dev                          396.37-0ubuntu1                            amd64        NVIDIA binary Xorg driver development files
ii  nvidia-machine-learning-repo-ubuntu1604 1.0.0-1                                    amd64        nvidia-machine-learning repository configuration files
ii  nvidia-modprobe                         396.37-0ubuntu1                            amd64        Load the NVIDIA kernel driver and create device files
rc  nvidia-opencl-icd-367                   367.48-0ubuntu1                            amd64        NVIDIA OpenCL ICD
ii  nvidia-opencl-icd-396                   396.37-0ubuntu1                            amd64        NVIDIA OpenCL ICD
ii  nvidia-prime                            0.8.2                                      amd64        Tools to enable NVIDIA's Prime
ii  nvidia-settings                         396.37-0ubuntu1                            amd64        Tool for configuring the NVIDIA graphics driver

It seems that there are 2 different versions present, could that be a problem ? (But I couldn't see why as everything worked before).

Finding this thread, I checked my kernel, which is appearently different from the ones mentionned in the thread:

ubuntu@ip-10-0-0-13:~$ uname -a
Linux ip-10-0-0-13 4.4.0-143-generic #169-Ubuntu SMP Thu Feb 7 07:56:38 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Has anyone run into this problem and know how to fix it ? Thanks in advance for your help !

EDIT:

When trying to upgrade the drivers with @Dehydrated_Mud 's method, I got the following error:

ERROR: The installation was canceled due to the availability or presence of an alternate driver installation. Please see /var/log/nvidia-installer.log for more details.

And the content of the log file:

nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Thu Mar 21 10:56:46 2019
installer version: 384.183

PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin

nvidia-installer command line:
    ./nvidia-installer
    --no-drm
    --disable-nouveau
    --dkms
    --silent
    --install-libglvnd

Using built-in stream user interface
-> Detected 4 CPUs online; setting concurrency level to 4.
-> Installing NVIDIA driver version 384.183.
-> The NVIDIA driver appears to have been installed previously using a different installer. To prevent potential conflicts, it is recommended either to update the existing installation using the same mechanism by which it was originally installed, or to uninstall the existing installation before installing this driver.

Please review the message provided by the maintainer of this alternate installation method and decide how to proceed:

The package that is already installed is named nvidia-396.

You can upgrade the driver by running:
`apt-get install nvidia-396 nvidia-modprobe nvidia-settings`

You can remove nvidia-396, and all related packages, by running:
`apt-get remove --purge nvidia-396 nvidia-modprobe nvidia-settings`

This package is maintained by NVIDIA (cudatools@nvidia.com).


(Answer: Abort installation)
ERROR: The installation was canceled due to the availability or presence of an alternate driver installation. Please see /var/log/nvidia-installer.log for more details.

Running apt-cache search nvidia | grep -P '^nvidia-[0-9]+\s' gives:

nvidia-331 - Transitional package for nvidia-331
nvidia-346 - Transitional package for nvidia-346
nvidia-304 - NVIDIA legacy binary driver - version 304.135
nvidia-340 - NVIDIA binary driver - version 340.107
nvidia-361 - Transitional package for nvidia-367
nvidia-352 - Transitional package for nvidia-375
nvidia-367 - Transitional package for nvidia-387
nvidia-375 - Transitional package for nvidia-418
nvidia-387 - NVIDIA binary driver - version 387.26
nvidia-418 - NVIDIA binary driver - version 418.39
nvidia-384 - NVIDIA binary driver - version 384.183
nvidia-390 - NVIDIA binary driver - version 390.116
nvidia-410 - NVIDIA binary driver - version 410.104
nvidia-396 - NVIDIA binary driver - version 396.82
Alda
  • 93
  • 1
  • 1
  • 6

5 Answers5

13

I fixed this problem by updating to the latest Nvidia drivers. Use:

nvcc --version

to get the cuda toolkit version number. For 9.0 the latest drivers are 384.183, and 410.104 for CUDA 10.0.

Then run:

 wget http://us.download.nvidia.com/tesla/384.183/NVIDIA-Linux-x86_64-384.183.run

to download the drivers.

Then run:

sudo sh ./NVIDIA-Linux-x86_64-384.183.run --no-drm --disable-nouveau --dkms --silent --install-libglvnd

to install the drivers.

run:

nvidia-smi

to check if the issue is resolved.

Dehydrated_Mud
  • 146
  • 1
  • 5
  • Hi. Thanks for your answer, but I run into an error due to the presence of already installed drivers: `ERROR: The installation was canceled due to the availability or presence of an alternate driver installation. Please see /var/log/nvidia-installer.log for more details.` Do I need to remove the already installed driver and if so, how do I do that ? Thanks in advance ! – Alda Mar 21 '19 at 10:58
  • Hey. When I tried to run the installation I got this error: `WARNING: One or more modprobe configuration files to disable Nouveau are already present at: /etc/modprobe.d/nvidia-installer-disable-nouveau.conf. Please be sure you have rebooted your system since these files were written....` – Milos Mar 21 '19 at 17:14
  • I encountered that warning as well. The install still went through though, and `nvidia-smi` worked without a hitch. Alda can you update your question with the relevant lines of the log file, as well as the output of `apt-cache search nvidia | grep -P '^nvidia-[0-9]+\s' ` – Dehydrated_Mud Mar 21 '19 at 22:18
  • Hi. I updated the question with everything. Thanks in advance :) – Alda Mar 22 '19 at 07:57
  • 1
    It appears the 396.x installation is blocking the attempt at a 384.x install. 396 is the latest driver for CUDA toolkit v9.2, not 9.0. So it appears that a driver-toolkit version mismatch is causing your problems. I recommend removing 396 per the instructions in the log file, and installing 384 per my answer. It appears that you have other packages related to 396 on your system. If you purge 396, you may find that you need to re-install other packages as needed. – Dehydrated_Mud Mar 22 '19 at 14:51
  • Did that. Everything seems to work fine now, thanks a lot for your help ! :) There's just one thing I'm not sure about: `nvidia-smi` tells me driver version 384.183 is running, so all good here, however this driver version does not appear in the list displayed by `dpkg -l | grep -i nvidia`, is that normal ? – Alda Mar 24 '19 at 13:37
  • I terminated my EC2 instance and rebooted it and then nvidia-smi stopped working. – Corey Levinson Feb 20 '20 at 23:42
7

While reinstalling the drivers is making the driver work properly, this doesn't solve the problem and is not a correct answer to this problem. I've observed the same issue on ubuntu, reinstalling the driver was a workaround until the day it broke again. The reason for this spontaneous nvidia cuda driver failures is ubuntu's automated security updates. When there is an update that rebuilds kernel, it will break cuda drivers and nvidia-smi will not communicate with the driver. A simple solution would be to disable automated security updates:

sudo apt -y remove unattended-upgrades
mordka
  • 392
  • 3
  • 11
0
#!/bin/bash

set -x

version=$1
#version=410.79
#version=410.104

wget http://us.download.nvidia.com/tesla/${version}/NVIDIA-Linux-x86_64-${version}.run 
sudo sh ./NVIDIA-Linux-x86_64-${version}.run --no-drm --disable-nouveau --dkms --silent --install-libglvnd 
  1. Save the above as something like install.sh.
  2. sh install.sh 410.104
  3. sudo modprobe nvidia

GPU should be right back, check with nvidia-smi

0

This worked for me:

sudo apt purge nvidia-driver-450
sudo apt autoremove
Deepak C U
  • 21
  • 4
-1

For multi cuda installations the chose the cuda versions that you intend to use. Then install them inorder from earliest to latest. For cuda-version 9.0 the latest drivers are 384.183, 9.1 is 390.116, and 410.104 for CUDA 10.0.

you can find the names in the following website, but don't use the .deb files.

https://www.nvidia.com/Download/Find.aspx

$ cd /usr/local
$ sudo rm cuda
$ sudo ln -s cuda-{$cuda_version} cuda

wget http://us.download.nvidia.com/tesla/${nvidia_version}/NVIDIA-Linux-x86_64-${nvidia_version}.run
>sudo sh ./NVIDIA-Linux-x86_64-${nvidia_version}.run --no-drm --disable-nouveau --dkms --silent --install-libglvnd

rigo
  • 326
  • 2
  • 9
  • Then when I tried `nvidia-smi` it said Killed, and so I tried to run it again and the EC2 hanged and Ctrl+C couldn't even escape – Corey Levinson Feb 20 '20 at 23:45
  • Did you get this to work? If so, edit my answer and I'll accept the edit. Or write your own answer and I'll remove mine. I know I have to redo this answer. Because, it's something like 1. nividia purge drivers and then 2. install drivers. – rigo Feb 23 '20 at 02:52
  • Actually, after a couple of hours of toying with it on EC2, I gave up. I am just going to go with preinstalled CUDA 10 drivers from the AWS Marketplace. Your answer kind of worked (after I rebooted the machine when it hanged). My problem was my Pytorch installation desired CUDA 10+ and I was on CUDA 9.1 ... I just wanted to cry man when I found that out. And the torch cuda90 installation wasn't amending anything – Corey Levinson Feb 23 '20 at 17:51