4

I'm trying to set up a container-optimized OS (COS) on GCE with a GPU, following the instructions at https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus. After creating the VM, it says to ssh in and run cos-extensions install gpu. That works; you can see during the install it runs nvidia-smi which prints out the driver version (440.33.01) and connects to the card.

But it installs the nvidia bins and libs in /var/lib/nvidia, which is mounted as noexec in this OS (it's very locked down). That means none of the libs or utilities work. And when you mount them to a docker container, they don't work there either; they're still noexec.

The only workaround I've found is to copy the whole /var/lib/nvidia dir to a tmpfs scratch disk and use it from there. Am I using it wrong, or is it just broken?

talonmies
  • 70,661
  • 34
  • 192
  • 269
GaryO
  • 5,873
  • 1
  • 36
  • 61

2 Answers2

2

This doesn't look to be a containerd issue but rather a Container-Optimized OS expected behaviour due to COS provides another level of hardening by providing security-minded default values for several features.

If you look at the documentation, for Container-Optimized OS filesystem, everything under /var is mounted as no-exec except for

  • /var/lib/google
  • /var/lib/docker
  • /var/lib/toolbox

Those are mounted with writable, executable and stateful properties.

On the other hand, Ubuntu containerd does not have the same strict exec/noexec depending on the mount like with COS, so, it could be a good idea to use Ubuntu based images instead of COS as a workaround.

Another option is to copy the contents of the /var/lib/nvidiaunder another mount point that was not mounted using the noexec option, as you already did.

Jose Luis Delgadillo
  • 2,348
  • 1
  • 6
  • 16
  • Trying to understand the solution, this is about the 6th thread I've read on it and I'm struggling a bit, so we should either copy `/var/lib/nvidia` to another mount point (not sure what this means), or create a new VM with Ubuntu (wasn't an option on creation for some reason) ? – Kevin Danikowski Aug 24 '21 at 15:58
  • Three years later, still doesn't work. Moreover these days it doesn't even install - exits with `no space left on device` error. Shame. – jayarjo May 07 '23 at 19:31
1

Turns out I wasn't doing anything wrong. This is confirmed now as a bug in cos-extensions: https://issuetracker.google.com/issues/164134488

Odd, because it seems like this would have shown up in testing.

There aren't any good production workarounds at the moment, because as a user it's hard to modify COS's behavior without some advanced scripting.

GaryO
  • 5,873
  • 1
  • 36
  • 61
  • 1
    You are right, you weren't doing anything wrong, it seems that it is not the only error with COS imagens, there is another post https://issuetracker.google.com/159702288 that is reporting Cos_conainerd images are mounting application volumes with noexec option, As I mentioned it could be an option to use Ubuntu as they did in that post. – Jose Luis Delgadillo Aug 14 '20 at 14:51
  • There's no option to use Ubuntu. Can you link the source of this information? – jayarjo May 07 '23 at 19:31