7

Im deploying an application in a docker container that requires CUDA 10. This is necessary to run some of the underlying pytorch functionality that the application uses.

However, the host server is running docker ce 17, Nvidia-docker v 1.0 with CUDA version 9, and I will not be able to upgrade the host.

I’m under the impression that I’m handcuffed to the v1 nvidia docker runtime and CUDA version available on the host.

Is there a way to run CUDA 10 on the container so I can leverage the functionality of this toolkit?

Berriel
  • 12,659
  • 4
  • 43
  • 67
JLuxton
  • 421
  • 1
  • 5
  • 17
  • 2
    It depends a lot on the specific configuration, especially which driver is loaded on the host, and whether or not the GPU in the host is a Tesla GPU or not. Under some circumstances, a docker container that depends on CUDA 10 can run on a CUDA 9 host, but it requires some specific steps. The requirements are all spelled out in [this document](https://docs.nvidia.com/deploy/cuda-compatibility/index.html). – Robert Crovella Jul 13 '19 at 04:41
  • Nice, hadn’t come across these requirements before. So what if the host doesn’t have Tesla GPU and has Quadros? I’m still unclear on the install steps I need to preform fo dockerfile and/or compose..? – JLuxton Jul 13 '19 at 14:00
  • It's also important to know the GPU driver version installed on the host. If the host has a Quadro GPU, then the "backward" compatibility libraries won't work with that, and your only possibility for success is if by chance the host has a newer driver than a "typical" CUDA 9 driver installed. If your host has a "typical" CUDA 9 driver installed, and it is a Quadro GPU, there is no possibility to run a container that depends on CUDA 10. – Robert Crovella Jul 13 '19 at 14:03
  • Also, while it doesn't address your question directly, [this thread](https://devtalk.nvidia.com/default/topic/1056770/cuda-setup-and-installation/how-to-use-cuda-compatibility-package-to-use-a-newer-driver-on-an-older-kernel-module/) has some possibly useful info regarding the compatibility library usage and containers. But again, the compatibility library will not work with a Quadro GPU, guaranteed. – Robert Crovella Jul 13 '19 at 14:07
  • The relationship of the GPU driver required (e.g. what is installed on the base machine) and CUDA version supported (e.g. what CUDA version you can or wish to use in the container) is expressed in Table 1 [here](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html). Since you have a Quadro GPU in your base machine, the compatibility library mechanism can't be used, and your only hope of running a CUDA 10.0 container is if by chance a "newer" driver like 410.48 (or newer/higher) was installed on the base machine. – Robert Crovella Jul 13 '19 at 14:21
  • If your base machine has a 396.xx or lower driver, there is no possiblity to run a CUDA 10.0 (or higher) container, if the base machine contains Quadro GPU(s) – Robert Crovella Jul 13 '19 at 14:21
  • Thanks Robert - not the news that I was hoping for but very helpful. So it sounds like I need to upgrade the host machine or try to make my application compatible with CUDA 9 if I can’t upgrade the host... because there is no way to use the backward compatibility libraries for the Quadro... – JLuxton Jul 13 '19 at 14:30

1 Answers1

10

In the general case, any specific CUDA version will require a minimum GPU driver version. That is covered in places like here and here (table 1). So to use CUDA 9.0 you would need at least a GPU driver version that supports CUDA 9.0, such as a R384 driver. To use CUDA 10.0 you would need at least a GPU driver version that supports CUDA 10.0, such as a R410 driver.

The usage of containers doesn't fundamentally change this. If you want to use a container that has CUDA 10 code in it, your base machine needs a driver that supports CUDA 10.

NVIDIA did start publishing compatibility libraries that allow modifications to the above statements. These compatibility libraries are available but not installed by default with a CUDA toolkit install. These compatibility libraries only work in certain cases, and they have certain requirements to be usable. The compatibility libraries are documented here.

One of the specific requirements for use of these compatibility libraries is that the GPU(s) in use must be Tesla-brand GPUs. GeForce, Quadro, Jetson, and Titan family GPUs are not supported by these compatibility libraries.

Furthermore, the libraries only work with certain combination of CUDA toolkit versions, and GPU driver versions installed on the base machine. This "compatibility matrix" is documented here (Table 3). Only the specific combinations of CUDA toolkit versions with installed driver versions will be usable for compatibility. To pick one example, if you wish to use CUDA 10.0, and your base machine has a Tesla GPU with a R396 driver installed, there is no compatibility support. In the same setup, however, if you wish to use CUDA 10.1, there is compatibility support for that.

If you have satisfied the requirements for compatibility usage, then the remaining step would be to install the compatibility libraries (or build your container from a base container that has the compatibility libraries already installed).

For a package manager CUDA install method, the method to install the compatibility libraries is simple (example on Ubuntu, installing the CUDA 10.1 compatibility to match CUDA 10.1 toolkit install):

sudo apt-get install cuda-compat-10.1

Make sure to match the version to the CUDA toolkit version that you are using (that you installed with the package manager method, or that was already installed in your container).

This compatibility "path" only began in the CUDA 9.0 timeframe. Systems that are equipped with drivers that predate CUDA 9.0 will not be usable in any way for this compatibility path. There are also various functional limitations and restrictions, which are covered in the documentation.

When this "compatibility path" is correctly installed and in use, the overall system configuration can "appear" to be violating the rules indicated at the top of this answer. For example a CUDA 10.1 application could possibly be running on a machine that had only a R396 driver installed.

For the specific question in view here, OP eventually indicated that the base machine had a Quadro GPU, so this "compatibility path" does not apply, and the only way to run e.g. a CUDA 10.0 container would be if a CUDA 10.0-capable driver is installed in the base machine, e.g. R410 or later driver.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • How about the host machine is CUDA 10, and R430, but I want to run a container with CUDA 9 codes? Somehow it gives me `Failed to initialize NVML: Driver/library version mismatch` – Raven Cheuk Jul 24 '20 at 17:46
  • Then you haven't properly installed or used the [nvidia container toolkit](https://github.com/NVIDIA/nvidia-docker). You should not be installing any driver components in the container (which is what you have done).You should let the container toolkit handle that. Driver gets installed on the host machine only. Take a look at the nvidia cuda containers (dockerfiles) to see how they are built. – Robert Crovella Jul 24 '20 at 17:52
  • So the `Failed to initialize NVML: Driver/library version mismatch` implies that I have installed a driver inside the container? Since I have created this container long time ago, I forgot what I did back then... But the CUDA 10 and R430 that I mentioned are both on the host, not inside the container. – Raven Cheuk Jul 24 '20 at 17:55
  • yes, that is what it means. And you shouldn't do that when you are using the nvidia container toolkit. If you're not using the nvidia container toolkit, you must ensure that the driver installed on the host precisely matches the driver installed in the container. It's recommended not to bother with that, and instead use the nvidia container toolkit, which handles this for you. In that case **do not install any GPU driver or driver components in the container**. – Robert Crovella Jul 24 '20 at 17:58
  • Is there any way to make this container works again? Like removing the drivers from the container? – Raven Cheuk Jul 24 '20 at 18:00
  • please ask a new question, I won't be able to sort it out in the comments – Robert Crovella Jul 24 '20 at 18:01
  • Thank you for your prompt reply. Here's the link to my question https://stackoverflow.com/questions/63079329/old-docker-containers-are-not-usable-no-gpu-after-updating-the-gpu-driver-in-t – Raven Cheuk Jul 24 '20 at 18:17