I'm building a container that has the CUDA toolkit in it but the toolkit is ~8GBs. It's extremely cumbersome because Docker won't cache this step and the installation includes redownloading all the files and then installing. Is there a way to force increase Docker's build cache size?
I also tried RUN --mount=type=cache,target=/var/cache/cuda microdnf install -y cuda
but this did not change anything. My understanding from the docs is that this should have forced it to reuse the cached files but Docker seems to have ignored it presumably because of the intrinsic cache size limit.
Update
Using a builder does not work. This docker file:
ARG USER_ID=1000
ARG GROUP_ID=1000
ARG USERNAME=simulateqcd
ARG GROUPNAME=simulateqcd
ARG CMD
# Build the CUDA toolkit. We do this because the size of the CUDA toolkit
# exceeds that of the Docker build cache. This means if you want to make any
# changes to the CUDA toolkit, you need to rebuild the entire image including
# redownloading the entire 8 GBs. Using it as a builder bypasses this problem.
FROM rockylinux:9-minimal as cuda-builder
# Install necessary dependencies
RUN microdnf update -y
RUN microdnf install -y gcc-c++
RUN microdnf install -y kernel-devel
# Add NVIDIA repository and install the CUDA Toolkit
RUN curl -sSL https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo > /etc/yum.repos.d/cuda.repo
RUN mkdir -p /var/cache/cuda
RUN microdnf install -y cuda
# Use an official Rocky Linux 9 image
FROM rockylinux:9-minimal
RUN microdnf update -y
RUN microdnf install -y cmake
RUN microdnf install -y gcc-c++
RUN microdnf install -y openmpi
RUN microdnf install -y openmpi-devel
RUN microdnf install -y kernel-devel
ARG USER_ID
ARG GROUP_ID
ARG USERNAME
ARG GROUPNAME
ARG CMD
# This code is just ensuring that our user exists and is running with the same permissions as the host user.
# This is usually userid/gid 1000
RUN (getent group ${GROUP_ID} && (echo groupdel by-id ${GROUP_ID}; groupdel $(getent group ${GROUP_ID} | cut -d: -f1))) ||:
RUN (getent group ${GROUPNAME} && (echo groupdel ${GROUPNAME}; groupdel ${GROUPNAME})) ||:
RUN (getent passwd ${USERNAME} && (echo userdel ${USERNAME}; userdel -f ${USERNAME})) ||:
RUN groupadd -g ${GROUP_ID} ${GROUPNAME}
RUN useradd -l -u ${USER_ID} -g ${GROUPNAME} ${USERNAME}
# Set environment variables for CUDA
# TODO: This probably needs to be permanent
ENV PATH=/usr/lib64/openmpi/bin:$PATH
ENV PATH=/usr/local/cuda-12.1/bin:$PATH
ENV LD_LIBRARY_PATH="/usr/local/cuda-12.1/lib64:${LD_LIBRARY_PATH}"
# Set the environment variables in the user's shell profile
RUN echo 'export PATH="/usr/lib64/openmpi/bin:$PATH"' >> /home/${USERNAME}/.profile
RUN echo 'export PATH="/usr/local/cuda-12.1/bin/nvcc:${PATH}"' >> /home/${USERNAME}/.profile
RUN echo 'export LD_LIBRARY_PATH="/usr/local/cuda-12.1/lib64:${LD_LIBRARY_PATH}"' >> /home/${USERNAME}/.profile
# Create simulateqcd directory
RUN mkdir /simulateqcd
RUN mkdir /build
# Copy source code into the container
COPY ../src /simulateqcd
COPY ../CMakeLists.txt /simulateqcd
COPY ../parameter /simulateqcd
COPY ../scripts /simulateqcd
COPY ../test_conf /simulateqcd
# Set the working directory to /app
WORKDIR /build
# Copy CUDA from the CUDA builder. Keep in mind that due to the size of these
# files there is a large chance that everything after this line will rerun
# after each build.
COPY --from=cuda-builder /usr/local/cuda-12.1 /usr/local/cuda-12.1
# Test CUDA installation
RUN nvcc --version
# Build code using cmake
# TODO - Need to parameterize these options
RUN cmake ../simulateqcd/ -DARCHITECTURE="70" -DUSE_GPU_AWARE_MPI=ON -DUSE_GPU_P2P=ON -DMPI_CXX_LIBRARIES=/usr/lib64/openmpi/lib/libmpi_cxx.so -DMPI_CXX_HEADER_DIR=/usr/include/openmpi-x86_64 -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.1/bin/nvcc
RUN make -j 4
# Set the user for the build. Note: Keep in mind that during the build, the user
# runs with root privileges.
USER ${USERNAME}
yielded:
[grant@rockylinux podman-build]$ ./simulate_qcd.sh setup --build
disabled
disabled
/usr/bin/podman
/usr/local/bin/docker-compose
Running PORT=9000 docker-compose --project-directory /opt/simulateqcd -f /opt/simulateqcd/podman-build/docker-compose.yml --profile core up --build --remove-orphans -d
#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 3.41kB done
#1 DONE 0.0s
#2 [internal] load .dockerignore
#2 transferring context: 2B done
#2 DONE 0.0s
#3 [internal] load metadata for docker.io/library/rockylinux:9-minimal
#3 DONE 10.2s
#4 [cuda-builder 1/7] FROM docker.io/library/rockylinux:9-minimal@sha256:9da6a8917c8cc0d429eb08ed1a4ba8588ba81807664853cd86c642a75ca91cfe
#4 resolve docker.io/library/rockylinux:9-minimal@sha256:9da6a8917c8cc0d429eb08ed1a4ba8588ba81807664853cd86c642a75ca91cfe done
#4 DONE 0.0s
#5 [internal] load build context
#5 transferring context: 88B
#5 transferring context: 98.65MB 1.2s done
#5 DONE 1.2s
#6 [cuda-builder 3/7] RUN microdnf install -y gcc-c++
#6 CACHED
#7 [cuda-builder 4/7] RUN microdnf install -y kernel-devel
#7 CACHED
#8 [cuda-builder 5/7] RUN curl -sSL https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo > /etc/yum.repos.d/cuda.repo
#8 CACHED
#9 [cuda-builder 2/7] RUN microdnf update -y
#9 CACHED
#10 [cuda-builder 6/7] RUN mkdir -p /var/cache/cuda
#10 CACHED
#11 [cuda-builder 7/7] RUN microdnf install -y cuda
#11 3.336 Downloading metadata...
It is still rebuilding on every run even though I used a builder.
Update 2
It partially works but it matters where you put the copy. This is my updated Dockerfile:
ARG USER_ID=1000
ARG GROUP_ID=1000
ARG USERNAME=simulateqcd
ARG GROUPNAME=simulateqcd
ARG CMD
# Build the CUDA toolkit. We do this because the size of the CUDA toolkit
# exceeds that of the Docker build cache. This means if you want to make any
# changes to the CUDA toolkit, you need to rebuild the entire image including
# redownloading the entire 8 GBs. Using it as a builder bypasses this problem.
FROM rockylinux:9-minimal as cuda-builder
# Add NVIDIA repository and install the CUDA Toolkit
RUN curl -sSL https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo > /etc/yum.repos.d/cuda.repo
RUN mkdir -p /var/cache/cuda
RUN microdnf install -y cuda
# Use an official Rocky Linux 9 image
FROM rockylinux:9-minimal
# Copy CUDA from the CUDA builder. Keep in mind that due to the size of these
# files there is a large chance that everything after this line will rerun
# after each build.
COPY --from=cuda-builder /usr/local/cuda-12.1 /usr/local/cuda-12.1
RUN microdnf update -y
RUN microdnf install -y cmake
RUN microdnf install -y gcc-c++
RUN microdnf install -y openmpi
RUN microdnf install -y openmpi-devel
RUN microdnf install -y kernel-devel
ARG USER_ID
ARG GROUP_ID
ARG USERNAME
ARG GROUPNAME
ARG CMD
# This code is just ensuring that our user exists and is running with the same permissions as the host user.
# This is usually userid/gid 1000
RUN (getent group ${GROUP_ID} && (echo groupdel by-id ${GROUP_ID}; groupdel $(getent group ${GROUP_ID} | cut -d: -f1))) ||:
RUN (getent group ${GROUPNAME} && (echo groupdel ${GROUPNAME}; groupdel ${GROUPNAME})) ||:
RUN (getent passwd ${USERNAME} && (echo userdel ${USERNAME}; userdel -f ${USERNAME})) ||:
RUN groupadd -g ${GROUP_ID} ${GROUPNAME}
RUN useradd -l -u ${USER_ID} -g ${GROUPNAME} ${USERNAME}
# Set environment variables for CUDA
# TODO: This probably needs to be permanent
ENV PATH=/usr/lib64/openmpi/bin:$PATH
ENV PATH=/usr/local/cuda-12.1/bin:$PATH
ENV LD_LIBRARY_PATH="/usr/local/cuda-12.1/lib64:${LD_LIBRARY_PATH}"
# Set the environment variables in the user's shell profile
RUN echo 'export PATH="/usr/lib64/openmpi/bin:$PATH"' >> /home/${USERNAME}/.profile
RUN echo 'export PATH="/usr/local/cuda-12.1/bin/nvcc:${PATH}"' >> /home/${USERNAME}/.profile
RUN echo 'export LD_LIBRARY_PATH="/usr/local/cuda-12.1/lib64:${LD_LIBRARY_PATH}"' >> /home/${USERNAME}/.profile
# Create simulateqcd directory
RUN mkdir /simulateqcd
RUN mkdir /build
# Copy source code into the container
COPY ../src /simulateqcd/src
COPY ../CMakeLists.txt /simulateqcd/CMakeLists.txt
COPY ../parameter /simulateqcd/parameter
COPY ../scripts /simulateqcd/scripts
COPY ../test_conf /simulateqcd/test_conf
# Set the working directory to /app
WORKDIR /build
# Test CUDA installation
RUN nvcc --version
# Build code using cmake
# TODO - Need to parameterize these options
RUN cmake ../simulateqcd/ -DARCHITECTURE="70" -DUSE_GPU_AWARE_MPI=ON -DUSE_GPU_P2P=ON -DMPI_CXX_LIBRARIES=/usr/lib64/openmpi/lib/libmpi_cxx.so -DMPI_CXX_HEADER_DIR=/usr/include/openmpi-x86_64 -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.1/bin/nvcc
RUN make -j
# Set the user for the build. Note: Keep in mind that during the build, the user
# runs with root privileges.
USER ${USERNAME}
I moved the copy to the beginning of the second stage. The problem with this approach is that now it does keep CUDA but because of the size of the COPY it will rerun... parts of the second stage? Now it's only caching some of the package installs.
[grant@rockylinux podman-build]$ ./simulate_qcd.sh setup --build
disabled
disabled
/usr/bin/podman
/usr/local/bin/docker-compose
Running PORT=9000 docker-compose --project-directory /opt/simulateqcd -f /opt/simulateqcd/podman-build/docker-compose.yml --profile core up --build --remove-orphans -d
#1 [internal] booting buildkit
#1 starting container buildx_buildkit_default
#1 starting container buildx_buildkit_default 0.5s done
#1 DONE 0.5s
#2 [internal] load build definition from Dockerfile
#2 transferring dockerfile: 3.33kB done
#2 DONE 0.0s
#3 [internal] load .dockerignore
#3 transferring context: 2B done
#3 DONE 0.0s
#4 [internal] load metadata for docker.io/library/rockylinux:9-minimal
#4 DONE 15.5s
#5 [cuda-builder 1/4] FROM docker.io/library/rockylinux:9-minimal@sha256:9da6a8917c8cc0d429eb08ed1a4ba8588ba81807664853cd86c642a75ca91cfe
#5 resolve docker.io/library/rockylinux:9-minimal@sha256:9da6a8917c8cc0d429eb08ed1a4ba8588ba81807664853cd86c642a75ca91cfe done
#5 DONE 0.0s
#6 [cuda-builder 2/4] RUN curl -sSL https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo > /etc/yum.repos.d/cuda.repo
#6 CACHED
#7 [cuda-builder 3/4] RUN mkdir -p /var/cache/cuda
#7 CACHED
#8 [stage-1 3/27] RUN microdnf update -y
#8 CACHED
#9 [stage-1 4/27] RUN microdnf install -y cmake
#9 CACHED
#10 [cuda-builder 4/4] RUN microdnf install -y cuda
#10 CACHED
#11 [stage-1 2/27] COPY --from=cuda-builder /usr/local/cuda-12.1 /usr/local/cuda-12.1
#11 CACHED
#12 [stage-1 5/27] RUN microdnf install -y gcc-c++
#12 CACHED
#13 [internal] load build context
#13 transferring context: 5.56MB 0.1s
#13 transferring context: 98.65MB 1.4s done
#13 DONE 1.4s
#14 [stage-1 6/27] RUN microdnf install -y openmpi
I have no idea why Docker has now decided it will cache the gcc-c++
install but not the openmpi install. I'm not sure what I'm looking at but it seems like there is an issue with Docker's cache system that makes it highly unpredictable. I do not understand why moving the copy later on in the second stage as I have in the above triggered a reinstall of CUDA and am generally struggling to understand why Docker is or is not deciding to cache certain things.