pytorch cannot detect gpu in nvidia/cuda:11.3.1-cudnn8-runtime-ubuntu20.04 base image

Question

I built a docker image based on nvidia/cuda:11.3.1-cudnn8-runtime-ubuntu20.04 my Dockerfile is like this:

ARG CUDA_VERSION=11.3.1
FROM nvidia/cuda:${CUDA_VERSION}-cudnn8-runtime-ubuntu20.04
ARG PYTORCH_VERSION=1.12.1

# Set a docker label to enable container to use SAGEMAKER_BIND_TO_PORT environment variable if present
LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true
LABEL maintainer="Change Healthcare"
LABEL dlc_major_version="1"

ENV PATH /opt/conda/bin:$PATH

RUN rm /etc/apt/sources.list.d/*
RUN apt-get update
RUN apt-get install -y curl wget

RUN curl -L -o ~/miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-py38_23.1.0-1-Linux-x86_64.sh
RUN chmod +x ~/miniconda.sh
RUN ~/miniconda.sh -b -p /opt/conda
RUN rm ~/miniconda.sh
RUN /opt/conda/bin/conda install -y ruamel_yaml==0.15.100 cython botocore mkl-include mkl
RUN /opt/conda/bin/conda clean -ya

RUN pip install --upgrade pip --trusted-host pypi.org --trusted-host files.pythonhosted.org
RUN ln -s /opt/conda/bin/pip /usr/local/bin/pip
RUN ln -s /opt/conda/bin/pip /usr/local/bin/pip3
RUN ln -s /opt/conda/bin/python /usr/local/bin/python
RUN pip install packaging==20.4 enum-compat==0.0.3

# Conda installs links for libtinfo.so.6 and libtinfo.so.6.2 both
# Which causes "/opt/conda/lib/libtinfo.so.6: no version information available" warning
# Removing link for libtinfo.so.6. This change is needed only for ubuntu 20.04-conda, and can be reverted
# once conda fixes the issue: https://github.com/conda/conda/issues/9680
RUN rm -rf /opt/conda/lib/libtinfo.so.6

WORKDIR /

RUN cd tmp/ \
 && rm -rf tmp*

# Uninstall and re-install torch and torchvision from the PyTorch website
RUN pip uninstall -y torch
RUN /opt/conda/bin/conda install pytorch==${PYTORCH_VERSION} cudatoolkit=11.3 -c pytorch

I start a container based on this image, and in the container, I ran the commands

import torch
torch.cuda.is_available()

it returns False.

If I build an image based on nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04

import torch
torch.cuda.is_available()

returns True

But devel image is much larger than runtime image and I want to use runtime as base image. Can anyone help me figure out how to let pytorch find GPU using nvidia/cuda:11.3.1-cudnn8-runtime-ubuntu20.04 as base image?

Regards, Arthur

score 0 · Answer 1 · answered Mar 14 '23 at 02:44

You need to run the image with the NVIDIA container toolkit (formerly known as nvidia-docker) in order to expose the GPU.

If you're doing this on AWS you might want to check their documentation, as they still use nvidia-docker. It may just be as simple as using nvidia-docker run instead of docker run, as iirc it's generally baked into their deep learning AMIs.

pytorch cannot detect gpu in nvidia/cuda:11.3.1-cudnn8-runtime-ubuntu20.04 base image

1 Answers1

Linked