Often, I will create a Dockerfile to install a single piece of software with something like pip3 install apache-airflow
.
When this software is complex, it will have a few dozen Python packages it depends on. For example, the above line prompts pip to collect ~20 dependent packages. Each of those packages will also have its own dependencies, so in the end I can end up with 100 or more python packages to install. That's okay, if the developers required those dependencies, nothing I can do about it.
However, occasionally this process breaks because some prerequisite Linux program, such as gcc
, was not installed before the call to pip
. On minimal distros with non-standard packages like Alpine, this gets even worse. I will get situations like:
- Request installing a single program (eg. airflow)
- Pip comes up a with a list of ~100 prerequisite python packages
- Downloads all of them
- Starts installing one by one
- Package 71/100 gives a compiler error and everything fails
Then I must go back, try to add some alpine packages that will take care of the compiler error, and try again (to see whether now package 74 will fail due to a different compiler error). However, each such trial takes a very long time, because the docker build
will re-download all the prerequisites and re-install the first ~70 packages that don't give errors. The Docker build cache is also not used: Technically, this whole lengthy process is one command, so it creates only one cached image and only if it actually succeeds. Also, adding an RUN apk get
before the pip3 install
causes the whole thing to be built again from scratch anyhow.
Given that, for example, the problem is with package #71, it makes no sense to waste many minutes re-downloading all 100 packages and re-build the first 70 when I am troubleshooting my Dockerfile. How can I use the cache more effectively in such a situation?
My current "solution" is to manually go through the pip install
output, compile a list of all the dependencies it gets, and then transform those into individual RUN pip3 install PACKAGE_NAME
commands in my Dockerfile. This way each package install attempt is cached separately. However, this does not seem like a good use of Dockerfile syntax. Is there a better way?