3

I used Compute Engine VM with T4 GPU for quite some time on COS and it has been working fine until recently that cos-extensions install gpu does not work like before.

I0830 07:32:58.419130     987 main.go:21] Checking if this is the only cos_gpu_installer that is running.
I0830 07:32:58.427417     987 install.go:74] Running on COS build id 16108.470.16
I0830 07:32:58.427566     987 installer.go:187] Getting the default GPU driver version
I0830 07:32:58.427911     987 utils.go:72] Downloading gpu_default_version from https://storage.googleapis.com/cos-tools/16108.470.16/gpu_default_version
I0830 07:32:58.548403     987 utils.go:120] Successfully downloaded gpu_default_version from https://storage.googleapis.com/cos-tools/16108.470.16/gpu_default_version
I0830 07:32:58.548594     987 install.go:85] Installing GPU driver version 450.119.04
I0830 07:32:58.549646     987 cache.go:72] map[BUILD_ID:16108.470.11 DRIVER_VERSION:450.119.04]
I0830 07:32:58.549674     987 install.go:120] Did not find cached version, installing the drivers...
I0830 07:32:58.549681     987 installer.go:82] Configuring driver installation directories
I0830 07:32:58.563327     987 installer.go:196] Updating container's ld cache
I0830 07:32:58.793692     987 signature.go:30] Downloading driver signature for version 450.119.04
I0830 07:32:58.793721     987 utils.go:72] Downloading 450.119.04.signature.tar.gz from https://storage.googleapis.com/cos-tools/16108.470.16/extensions/gpu/450.119.04.signature.tar.gz
E0830 07:32:58.828902     987 artifacts.go:106] Failed to download extensions/gpu/450.119.04.signature.tar.gz from public GCS: failed to download 450.119.04.signature.tar.gz, status: 404 Not Found
E0830 07:32:58.829401     987 install.go:175] failed to download driver signature: failed to download driver signature for version 450.119.04: failed to download extensions/gpu/450.119.04.signature.tar.gz

It seems like the installer could not find the driver signature. I have looked into this and followed the workaround by doing

/usr/bin/docker run --rm \
    --privileged \
    --net=host \
    --pid=host \
    --volume /dev:/dev \
    --volume /:/root \
    --volume /var/lib/toolbox/nvidia:/usr/local/nvidia \
    --env NVIDIA_DRIVER_VERSION=450.119.04 \
    gcr.io/cos-cloud/cos-gpu-installer:latest

but got this instead

+ COS_KERNEL_INFO_FILENAME=kernel_info
+ COS_KERNEL_SRC_HEADER=kernel-headers.tgz
+ TOOLCHAIN_URL_FILENAME=toolchain_url
+ TOOLCHAIN_ENV_FILENAME=toolchain_env
+ TOOLCHAIN_PKG_DIR=/build/cos-tools
+ CHROMIUMOS_SDK_GCS=https://storage.googleapis.com/chromiumos-sdk
+ ROOT_OS_RELEASE=/root/etc/os-release
+ KERNEL_SRC_HEADER=/build/usr/src/linux
+ NVIDIA_DRIVER_VERSION=450.119.04
+ NVIDIA_DRIVER_MD5SUM=
+ NVIDIA_INSTALL_DIR_HOST=/var/lib/nvidia
+ NVIDIA_INSTALL_DIR_CONTAINER=/usr/local/nvidia
+ ROOT_MOUNT_DIR=/root
+ CACHE_FILE=/usr/local/nvidia/.cache
+ LOCK_FILE=/root/tmp/cos_gpu_installer_lock
+ LOCK_FILE_FD=20
+ set +x
[INFO    2021-08-30 07:36:38 UTC] PRELOAD: false
[INFO    2021-08-30 07:36:38 UTC] Running on COS build id 16108.470.16
[INFO    2021-08-30 07:36:38 UTC] Data dependencies (e.g. kernel source) will be fetched from https://storage.googleapis.com/cos-tools/16108.470.16
[INFO    2021-08-30 07:36:38 UTC] Checking if this is the only cos-gpu-installer that is running.
[INFO    2021-08-30 07:36:38 UTC] Checking if third party kernel modules can be installed
/tmp/esp /
/
[INFO    2021-08-30 07:36:38 UTC] Checking cached version
/entrypoint.sh: line 172: CACHE_BUILD_ID: unbound variable

It seems like there are some changes going on with COS and COS GPU driver (maybe?), but just want to know whether there is a workaround on this problem apart from waiting GCP to solve things out.

Phumin W.
  • 33
  • 3
  • I'm running into this very issue myself right now, no answer yet except for a related issue here: https://stackoverflow.com/questions/68030764/cos-extensions-install-gpu-failed-to-download-driver-signature-on-gcp-compute-en – jlos Aug 30 '21 at 19:59
  • 1
    Have you got anything under `[gpu]` when running `cos-extensions list` ? I got nothing (no any driver version listed) – Phumin W. Aug 31 '21 at 01:40
  • 1
    Nothing :/ Which is weird because the other work arounds do have something listed... I tried to sign up for GCloud support but I need to go through an entire "Organization" process in order to get that thing set-up which for some reason is a big hassle. – jlos Aug 31 '21 at 10:46

1 Answers1

1

This is the same case as the one Jan Vansteenlandt linked to.

This happens in some versions of COS;

For example latest stable COS version available now - 89-16108:

vm-16108 ~ # cos-extensions list Available extensions for COS version
89-16108.470.16:

[gpu]

There's no driver listed under [gpu] and running cos-extensions install gpu ends in the same way as in your case. When trying to run the docker container you mentioned also yielded the same results.

This is a known issue and has already been raised on IssueTracker. You can fallow the link and click on +1 button, also you can comment and post your own findings in the thread.

There's also a workaround in the thread so you may give it a go.

If you can use some older version of COS (85-13310 for example) - the driver is listed:

vm-13310 ~ # cos-extensions list
Available extensions for COS version 85-13310.1308.10:

[gpu]
450.119.04 [default]

And when you run cos-extensions install gpu it will result in succesfull installation of NVIDIA drivers:


vm-13310 ~ # cos-extensions install gpu
I0831 14:25:11.405591    1168 main.go:21] Checking if this is the only cos_gpu_installer that is running.
I0831 14:25:11.407510    1168 install.go:74] Running on COS build id 13310.1308.10
I0831 14:25:11.407519    1168 installer.go:187] Getting the default GPU driver version
I0831 14:25:11.407581    1168 utils.go:72] Downloading gpu_default_version from https://storage.googleapis.com/cos-tools/13310.1308.10/gpu_default_version
I0831 14:25:11.448046    1168 utils.go:120] Successfully downloaded gpu_default_version from https://storage.googleapis.com/cos-tools/13310.1308.10/gpu_default_version
I0831 14:25:11.448539    1168 install.go:85] Installing GPU driver version 450.119.04
I0831 14:25:11.448751    1168 cache.go:69] error: failed to read file /root/var/lib/nvidia/.cache: open /root/var/lib/nvidia/.cache: no such file or directory
I0831 14:25:11.448942    1168 install.go:120] Did not find cached version, installing the drivers...
I0831 14:25:11.449084    1168 installer.go:82] Configuring driver installation directories
I0831 14:25:11.469718    1168 installer.go:196] Updating container's ld cache
I0831 14:25:11.480682    1168 signature.go:30] Downloading driver signature for version 450.119.04
I0831 14:25:11.481007    1168 utils.go:72] Downloading 450.119.04.signature.tar.gz from https://storage.googleapis.com/cos-tools/13310.1308.10/extensions/gpu/450.119.04.signature.tar.gz
I0831 14:25:11.506186    1168 utils.go:120] Successfully downloaded 450.119.04.signature.tar.gz from https://storage.googleapis.com/cos-tools/13310.1308.10/extensions/gpu/450.119.04.signature.tar.gz
I0831 14:25:11.506541    1168 signature.go:37] Decompressing signature /build/sign-gpu-driver/450.119.04.signature.tar.gz
I0831 14:25:11.510104    1168 installer.go:68] Downloading GPU driver installer version 450.119.04
I0831 14:25:11.511637    1168 utils.go:72] Downloading GPU driver installer from https://storage.googleapis.com/nvidia-drivers-us-public/nvidia-cos-project/85/tesla/450_00/450.119.04/NVIDIA-Linux-x86_64-450.119.04_85-13310-1308-10.cos
I0831 14:25:12.885856    1168 utils.go:120] Successfully downloaded GPU driver installer from https://storage.googleapis.com/nvidia-drivers-us-public/nvidia-cos-project/85/tesla/450_00/450.119.04/NVIDIA-Linux-x86_64-450.119.04_85-13310-1308-10.cos

-----  removed some lines for better readibility  -----

I0831 14:28:49.433597    1168 cache.go:58] Updated cached version as
I0831 14:28:49.498379    1168 cache.go:60] BUILD_ID=13310.1308.10
I0831 14:28:49.498560    1168 cache.go:60] DRIVER_VERSION=450.119.04
I0831 14:28:49.498694    1168 installer.go:32] Verifying GPU driver installation
I0831 14:28:50.309502    1168 utils.go:334] Tue Aug 31 14:28:50 2021       
I0831 14:28:50.309879    1168 utils.go:334] +-----------------------------------------------------------------------------+
I0831 14:28:50.311093    1168 utils.go:334] | NVIDIA-SMI 450.119.04   Driver Version: 450.119.04   CUDA Version: 11.0     |
I0831 14:28:50.311300    1168 utils.go:334] |-------------------------------+----------------------+----------------------+
I0831 14:28:50.311497    1168 utils.go:334] | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
I0831 14:28:50.311640    1168 utils.go:334] | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
I0831 14:28:50.311784    1168 utils.go:334] |                               |                      |               MIG M. |
I0831 14:28:50.311949    1168 utils.go:334] |===============================+======================+======================|
I0831 14:28:50.322257    1168 utils.go:334] |   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
I0831 14:28:50.322566    1168 utils.go:334] | N/A   76C    P0    27W /  70W |      0MiB / 15109MiB |      0%      Default |
I0831 14:28:50.322708    1168 utils.go:334] |                               |                      |                  N/A |
I0831 14:28:50.322878    1168 utils.go:334] +-------------------------------+----------------------+----------------------+
I0831 14:28:50.323119    1168 utils.go:334]                                                                                
I0831 14:28:50.323293    1168 utils.go:334] +-----------------------------------------------------------------------------+
I0831 14:28:50.323431    1168 utils.go:334] | Processes:                                                                  |
I0831 14:28:50.323597    1168 utils.go:334] |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
I0831 14:28:50.323715    1168 utils.go:334] |        ID   ID                                                   Usage      |
I0831 14:28:50.323863    1168 utils.go:334] |=============================================================================|
I0831 14:28:50.324222    1168 utils.go:334] |  No running processes found                                                 |
I0831 14:28:50.324439    1168 utils.go:334] +-----------------------------------------------------------------------------+
I0831 14:28:50.465730    1168 modules.go:48] Updating host's ld cache
I0831 14:28:52.305122    1168 install.go:167] Finished installing the drivers.

Wojtek_B
  • 4,245
  • 1
  • 7
  • 21
  • Thank you for your response. I just checked my VMs again and it has been working fine since around Aug 31 8pm UTC without any workaround, but thanks again anyway. GCP has done some fixes on this I guess (although took quite some time :( ) – Phumin W. Aug 31 '21 at 23:25