1

How do I find the driver version of the node in Autopilot?

I need the 525 driver version on the node - but I suspect it's 470.

Is there a way to specify a nodeSelector to provision nodes with 525 version of the driver?

GRS
  • 2,807
  • 4
  • 34
  • 72

1 Answers1

1

In Autopilot clusters, GKE manages the driver version selection and installation, however if you need the list of GPU driver versions associated with GKE version, refer to the corresponding Container-Optimized OS page linked in the GKE current versions table.

For example if you have selected GKE version 1.25.7-gke.1000 the COS version available is cos-101-17162-127-27 and the gpu driver version supported will be v470.182.03(default), v525.105.17

You can follow this documentation for deploying your gpu workloads on autopilot cluster.

Edit 1: The below steps within the lines are meant for standard clusters.


After adding GPU nodes to your cluster, you need to install NVIDIA's device drivers on the nodes. Google provides a DaemonSet that you can apply to install the drivers. On GPU nodes that use Container-Optimized OS images, you also have the option of selecting between the default GPU driver version or a newer version


Note: This content is taken from google cloud official documents which are embedded into the content.

  • From the DaemonSet, it seems that this is for GKE, and not Autopilot nodes. Is that not the case? I'm already on `1.25.7-gke.1000`, is there a way to check which driver does the node have installed? – GRS Apr 25 '23 at 14:52
  • @GRS The daemon set is the same in both the standard nodes and autopilot nodes, the only difference is that in autopilot nodes the drivers updates will be managed by google where as in standard nodes you have to handle them and autopilot nodes only support `nvidia-tesla-t4` and `nvidia-tesla-a100` as of now. refer to this [doc](https://cloud.google.com/kubernetes-engine/docs/how-to/autopilot-gpus) for more information. – Kranthiveer Dontineni Apr 26 '23 at 11:54
  • When running `kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml` on Autopilot cluster, I get the error that the namespace "kube-system" is managed and the request's verb "create" is denied`. – GRS Apr 26 '23 at 16:49
  • @GRS The drivers version will be controlled by Google in autopilot GKE clusters(I have updated my answer), but if you wanted to know the driver version your workload is using you can use `kubectl describe ` command and in the output you can find the gpu_driver where you can find the driver name and version, else you can login to your node or pods and in the respective driver libraries you can find the driver version – Kranthiveer Dontineni Apr 29 '23 at 12:33
  • v470.161.03(default),v525.60.13 what does this mean specifically? how would you force autopilot gke node to have v525 driver installed as opposed to 470? – Saccarab May 08 '23 at 08:41
  • 1
    @Saccarab at this point of time I knew only a workaround for this, after node got created we can go to libraries path and upgrade the driver version or you can use an custom image with the required driver version installed because in autopilot GKE cluster everything is managed and we can't select driver version. If you want to have grip on your gpu and driver versions it's easy to go with a standard cluster – Kranthiveer Dontineni May 08 '23 at 12:12
  • aha, thanks for the info. I guess I'll do standard till they support 525 on autopilot – Saccarab May 08 '23 at 14:06