What can be the cause of GPU stopped working on google cloud vm?

Question

I use a VM with tensorflow on google cloud.

The VM was created using the official google image https://console.cloud.google.com/marketplace/product/click-to-deploy-images/deeplearning?_ga=2.148488823.1903313271.1624440425-168625328.1576904373

It worked for few months but suddenly today I am getting an error "failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected"

What can the cause of the change ?
Is it possible that my VM was updated ? problem with google cloud ?

TF version: 2.4.1
GPU: 1 x NVIDIA Tesla T4

Update
My VM received an update, that probably cause the problem
Any Advice about the drivers I need to reinstall ?

Start-Date: 2021-06-24  05:28:25
Commandline: /usr/bin/unattended-upgrade
Upgrade: linux-compiler-gcc-8-x86:amd64 (4.19.181-1, 4.19.194-1)
End-Date: 2021-06-24  05:28:26

Start-Date: 2021-06-24  05:28:27
Commandline: /usr/bin/unattended-upgrade
Upgrade: libhogweed4:amd64 (3.4.1-1, 3.4.1-1+deb10u1), libnettle6:amd64 (3.4.1-1, 3.4.1-1+deb10u1)
End-Date: 2021-06-24  05:28:27

Start-Date: 2021-06-24  05:28:28
Commandline: /usr/bin/unattended-upgrade
Upgrade: shim-helpers-amd64-signed:amd64 (1+15+1533136590.3beb971+7+deb10u1, 1+15.4+5~deb10u1), shim-unsigned:amd64 (15+1533136590.3beb971-7+deb10u1, 15.4-5~deb10u1), shim-signed:amd64 (1.33+15+1533136590.3beb971-7, 1.36~1+deb10u1+15.4-5~deb10u1), shim-signed-common:amd64 (1.33+15+1533136590.3beb971-7, 1.36~1+deb10u1+15.4-5~deb10u1)
End-Date: 2021-06-24  05:28:32

Start-Date: 2021-06-24  05:28:33
Commandline: /usr/bin/unattended-upgrade
Upgrade: base-files:amd64 (10.3+deb10u9, 10.3+deb10u10)
End-Date: 2021-06-24  05:28:33

Start-Date: 2021-06-24  05:28:34
Commandline: /usr/bin/unattended-upgrade
Upgrade: libglib2.0-0:amd64 (2.58.3-2+deb10u2, 2.58.3-2+deb10u3)
End-Date: 2021-06-24  05:28:34

Start-Date: 2021-06-24  05:28:35
Commandline: /usr/bin/unattended-upgrade
Upgrade: libklibc:amd64 (2.0.6-1, 2.0.6-1+deb10u1), klibc-utils:amd64 (2.0.6-1, 2.0.6-1+deb10u1)
End-Date: 2021-06-24  05:28:35

Start-Date: 2021-06-24  05:28:36
Commandline: /usr/bin/unattended-upgrade
Install: linux-image-4.19.0-17-cloud-amd64:amd64 (4.19.194-1, automatic)
Upgrade: linux-image-cloud-amd64:amd64 (4.19+105+deb10u11, 4.19+105+deb10u12)
End-Date: 2021-06-24  05:28:45

Start-Date: 2021-06-24  05:28:45
Commandline: /usr/bin/unattended-upgrade
Upgrade: linux-libc-dev:amd64 (4.19.181-1, 4.19.194-1)
End-Date: 2021-06-24  05:28:46

Start-Date: 2021-06-24  05:28:47
Commandline: /usr/bin/unattended-upgrade
Upgrade: isc-dhcp-client:amd64 (4.4.1-2, 4.4.1-2+deb10u1)
End-Date: 2021-06-24  05:28:47

Start-Date: 2021-06-24  05:28:48
Commandline: /usr/bin/unattended-upgrade
Upgrade: libxml2:amd64 (2.9.4+dfsg1-7+deb10u1, 2.9.4+dfsg1-7+deb10u2)
End-Date: 2021-06-24  05:28:48

Start-Date: 2021-06-24  05:28:49
Commandline: /usr/bin/unattended-upgrade
Upgrade: libgcrypt20:amd64 (1.8.4-5, 1.8.4-5+deb10u1)
End-Date: 2021-06-24  05:28:49

Start-Date: 2021-06-24  05:28:50
Commandline: /usr/bin/unattended-upgrade
Upgrade: linux-kbuild-4.19:amd64 (4.19.181-1, 4.19.194-1)
End-Date: 2021-06-24  05:28:50

Start-Date: 2021-06-24  05:28:51
Commandline: /usr/bin/unattended-upgrade
Upgrade: libgnutls30:amd64 (3.6.7-4+deb10u6, 3.6.7-4+deb10u7)
End-Date: 2021-06-24  05:28:51

Update 2 Tried to update nvidia drivers using

sudo /opt/deeplearning/install-driver.sh

Now getting the error
cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error

running nvidia-smi yield

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   73C    P0    24W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

and nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:38_PDT_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.TC445_37.28540450_0

Does this answer your question? [TensorFlow : failed call to cuInit: CUDA\_ERROR\_NO\_DEVICE](https://stackoverflow.com/questions/48658204/tensorflow-failed-call-to-cuinit-cuda-error-no-device) — yudhiesh, Jun 24 '21 at 07:59
Have you followed the [documentation](https://cloud.google.com/deep-learning-vm/docs/tensorflow_start_instance) to create when setting up your VM? — Alexandre Moraes, Jun 24 '21 at 13:04
yes, and VM worked for several month without any problem . I actual have 3 different VM's in different regions, all with the same problem — shasho, Jun 24 '21 at 14:04
Currently, there is a public issue open within Google's Issue Tracker, you can follow the thread and comment there that you are also affected by it, [here](https://issuetracker.google.com/issues/191612865). Click the "star" icon to indicate you are also affected by the issue. — Alexandre Moraes, Jun 30 '21 at 09:18

What can be the cause of GPU stopped working on google cloud vm?

0 Answers0