How to leverage cache better for pip installs

Question

Often, I will create a Dockerfile to install a single piece of software with something like pip3 install apache-airflow.

When this software is complex, it will have a few dozen Python packages it depends on. For example, the above line prompts pip to collect ~20 dependent packages. Each of those packages will also have its own dependencies, so in the end I can end up with 100 or more python packages to install. That's okay, if the developers required those dependencies, nothing I can do about it.

However, occasionally this process breaks because some prerequisite Linux program, such as gcc, was not installed before the call to pip. On minimal distros with non-standard packages like Alpine, this gets even worse. I will get situations like:

Request installing a single program (eg. airflow)
Pip comes up a with a list of ~100 prerequisite python packages
Downloads all of them
Starts installing one by one
Package 71/100 gives a compiler error and everything fails

Then I must go back, try to add some alpine packages that will take care of the compiler error, and try again (to see whether now package 74 will fail due to a different compiler error). However, each such trial takes a very long time, because the docker build will re-download all the prerequisites and re-install the first ~70 packages that don't give errors. The Docker build cache is also not used: Technically, this whole lengthy process is one command, so it creates only one cached image and only if it actually succeeds. Also, adding an RUN apk get before the pip3 install causes the whole thing to be built again from scratch anyhow.

Given that, for example, the problem is with package #71, it makes no sense to waste many minutes re-downloading all 100 packages and re-build the first 70 when I am troubleshooting my Dockerfile. How can I use the cache more effectively in such a situation?

My current "solution" is to manually go through the pip install output, compile a list of all the dependencies it gets, and then transform those into individual RUN pip3 install PACKAGE_NAME commands in my Dockerfile. This way each package install attempt is cached separately. However, this does not seem like a good use of Dockerfile syntax. Is there a better way?

I'm not sure whether builds are cached, but this may help with downloading... https://stackoverflow.com/questions/58018300/using-a-pip-cache-directory-in-docker-builds — chash, Oct 10 '19 at 21:33

score 1 · Answer 1 · answered Oct 11 '19 at 00:27

Here's three suggestions. The important part here is that you don't need to get it perfect the first time; so break the problem into smaller pieces and do some experimentation.

The first generically useful suggestion is to break up your RUN statements during development. Typical Docker style is to have a small number of big RUN statements

RUN apk add ... \
 && pip3 install -r requirements.txt \
 && ... \
 && apk del ...

But while you're developing this and trying to work out the issues it can be beneficial to break this into smaller statements

RUN apk add ...
# If the next line breaks, you'll get caching on the previous line
RUN pip3 install -r requirements.txt

Debug this and combine them later.

While you're running docker build it will print out lines after each step like

 ---> e38ccc1e7f56

Each of these lines is a valid Docker image ID, and you can run

docker run --rm -it e38ccc1e7f56 sh

to get a debugging shell on the output of that step. For a second trick, you can force the RUN command to succeed

RUN pip3 install -r requirements.txt || true

This will always "succeed", so you'll get a valid Docker image out. This will be in the state you're trying to debug from: the package installation has failed, but everything up to here in the Dockerfile has run and pip has already downloaded the packages.

Now you have an interactive shell in a throw-away container on the result of a broken build. You can pretty freely experiment here. Try running the pip3 install again; is it better? Does apk add some specific development package help, or a compiler? The important third trick here is to do this experimentation while you have your Dockerfile open in a text editor: you probably already have a working apk add line that you can easily add more packages to, so every time you need another package to get the pip3 install one step further, just add it to your Dockerfile.

Your thought of pip3 installing packages one at a time is in this vein too. Is it an optimal production Dockerfile? No, but it will be much easier for you to debug why an individual package isn't installing.

Once you have everything working, you should remove the extra || true bits, combine your RUN lines together, and so on. Think of it as a final cleanup step before you're ready to ship.

Newer Docker has a non-default alternate build engine that supports some experimental build syntax. That has a specific RUN variant explicitly supporting cache directories and you might be able to use it to cache your .pip directory. I haven't seen a lot of use of this, in part because it's so prominently flagged as "experimental".

score 0 · Answer 2 · answered Oct 10 '19 at 21:28

There is a cache built-in to pip, I think the cache is stored somewhere in the user's home directory. I am making some assumptions here, but it seems that this cache is not utilized because you are starting from a fresh docker container each time. If you could open a shell into the container, you could experiment with installing packages until you have satisfied all the necessary dependencies and the cache should be utilized. Once you have identified all the needed packages for a full-install, you should be all set.

Here is some more info about the cache for pip:

https://pip.pypa.io/en/stable/reference/pip_install/#caching

Also, I would recommend looking into pip freeze, this is not a solution to your problems regarding dependencies, but it should help add some determinism to your environments. Here is a link to the official documentation:

https://pip.pypa.io/en/stable/reference/pip_freeze/

Basically, once you have ensured that your dependencies are fulfilled and you have a working environment, you can use pip freeze to produce a requirements.txt that can be used to produce the same exact set of installed packages. This means your install will be deterministic as the same versions will be installed each time.

$ env1/bin/pip freeze > requirements.txt
$ env2/bin/pip install -r requirements.txt

How to leverage cache better for pip installs

2 Answers2