17

I'm trying to use GPU from inside my docker container. I'm using docker with version 19.03 on Ubuntu 18.04.

Outside the docker container if I run nvidia-smi I get the below output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   30C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

If I run the same thing inside the container created from nvidia/cuda docker image, I get the same output as above and everything is running smoothly. torch.cuda.is_available() returns True.

But if I run the same nvidia-smi command inside any other docker container, it gives the following output where you can see that the CUDA Version is coming as N/A. Inside the containers torch.cuda.is_available() also returns False.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   30C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I have installed nvidia-container-toolkit using the following commands.

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/ubuntu18.04/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install nvidia-container-toolkit
sudo systemctl restart docker

I started my containers using the following commands

sudo docker run --rm --gpus all nvidia/cuda nvidia-smi
sudo docker run -it --rm --gpus all ubuntu nvidia-smi
talonmies
  • 70,661
  • 34
  • 192
  • 269
Sai Chander
  • 829
  • 1
  • 6
  • 15

2 Answers2

6

For anybody arriving here looking how to do it with docker compose, add to your service:

deploy:
  resources:
    reservations:
      devices:
      - driver: nvidia
        capabilities:
          - gpu
          - utility # nvidia-smi
          - compute # CUDA. Required to avoid "CUDA version: N/A"
          - video   # NVDEC/NVENC. For instance to use a hardware accelerated ffmpeg. Skip it if you don't need it

Doc: https://docs.docker.com/compose/gpu-support

GG.
  • 21,083
  • 14
  • 84
  • 130
5

docker run --rm --gpus all nvidia/cuda nvidia-smi should NOT return CUDA Version: N/A if everything (aka nvidia driver, CUDA toolkit, and nvidia-container-toolkit) is installed correctly on the host machine.

Given that docker run --rm --gpus all nvidia/cuda nvidia-smi returns correctly. I also had problem with CUDA Version: N/A inside of the container, which I had luck in solving:

Please see my answer https://stackoverflow.com/a/64422438/2202107 (obviously you need to adjust and install the matching/correct versions of everything)

Sida Zhou
  • 3,529
  • 2
  • 33
  • 48
  • Once `CUDA Version: N/A` problem was solved, tensorflow (gpu) just worked immediately without any configuration. I assume pytorch will do the same. – Sida Zhou Oct 19 '20 at 07:06
  • Thank you so much for the answer and seems like it would work for me as well. I've gone through your answer and saw that there are various other packages which needs to be installed. For me, I'm currently deploying a package on a server which is not connected to the internet. So, I'll have to manually download the packages mentioned and their dependencies. I'll try to see if I could do it in the future. – Sai Chander Oct 20 '20 at 14:31
  • @SaiChander I have the same limitation, that's why I use docker. I can install everything inside docker on my local machine (where internet isn't restricted) and then upload the image to the server (where internet is restricted) either via internal docker registeries or via docker load/save – Sida Zhou Oct 21 '20 at 10:05
  • I create a docker image and then use docker save to create a tar file and move the tar file to the server where internet is not accessible, I use docker load command to load the image from the tar file. Now, When I run the docker run command, it will still try to execute the apt-get update && apt-get isntall... command right? Or can I somehow avoid it? – Sai Chander Oct 22 '20 at 09:01
  • You need to install nvidia-container-toolkit https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html – desertSniper87 Feb 24 '22 at 10:24