What causes a cache invalidation when building a Dockerfile?

Question

I've been reading docs Best practices for writing Dockerfiles. I encountered small incorrectness (IMHO) for which meaning was clear after reading further:

Using apt-get update alone in a RUN statement causes caching issues and subsequent apt-get install instructions fail.

Why fail I wondered. Later came explanation of what they meant by "fail":

Because the apt-get update is not run, your build can potentially get an outdated version of the curl and nginx packages.

However, for the following I still cannot understand what they mean by "If not, the cache is invalidated.":

Starting with a parent image that is already in the cache, the next instruction is compared against all child images derived from that base image to see if one of them was built using the exact same instruction. If not, the cache is invalidated.

That part is mentioned in some answers on SO e.g. How does Docker know when to use the cache during a build and when not? and as a whole the concept of cache invalidation is clear to me, I've read below:

When does Docker image cache invalidation occur? Which algorithm Docker uses for invalidate cache?

But what is meaning of "if not"? At first I was sure the phrase meant if no such image is found. That would be overkill - to invalidate cache which maybe useful later for other builds. And indeed it is not invalidated if no image is found when I've tried below:

$ docker build -t alpine:test1 - <<HITTT
> FROM apline
> RUN echo "test1"
> RUN echo "test1-2"
> HITTT
Sending build context to Docker daemon  3.072kB
Step 1/3 : FROM apline
pull access denied for apline, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
(base) nb0408:docker a.martianov$ docker build -t alpine:test1 - <<HITTT
> FROM alpine
> RUN echo "test1"
> RUN echo "test1-2"
> HITTT
Sending build context to Docker daemon  3.072kB
Step 1/3 : FROM alpine
 ---> 965ea09ff2eb
Step 2/3 : RUN echo "test1"
 ---> Running in 928453d33c7c
test1
Removing intermediate container 928453d33c7c
 ---> 0e93df31058d
Step 3/3 : RUN echo "test1-2"
 ---> Running in b068bbaf8a75
test1-2
Removing intermediate container b068bbaf8a75
 ---> daeaef910f21
Successfully built daeaef910f21
Successfully tagged alpine:test1

$ docker build -t alpine:test1-1 - <<HITTT
> FROM alpine
> RUN echo "test1"
> RUN echo "test1-3"
> HITTT
Sending build context to Docker daemon  3.072kB
Step 1/3 : FROM alpine
 ---> 965ea09ff2eb
Step 2/3 : RUN echo "test1"
 ---> Using cache
 ---> 0e93df31058d
Step 3/3 : RUN echo "test1-3"
 ---> Running in 74aa60a78ae1
test1-3
Removing intermediate container 74aa60a78ae1
 ---> 266bcc6933a8
Successfully built 266bcc6933a8
Successfully tagged alpine:test1-1

$ docker build -t alpine:test1-2 - <<HITTT
> FROM alpine
> RUN "test2"
> RUN 
(base) nb0408:docker a.martianov$ docker build -t alpine:test2 - <<HITTT
> FROM alpine
> RUN echo "test2"
> RUN echo "test1-3"
> HITTT
Sending build context to Docker daemon  3.072kB
Step 1/3 : FROM alpine
 ---> 965ea09ff2eb
Step 2/3 : RUN echo "test2"
 ---> Running in 1a058ddf901c
test2
Removing intermediate container 1a058ddf901c
 ---> cdc31ac27a45
Step 3/3 : RUN echo "test1-3"
 ---> Running in 96ddd5b0f3bf
test1-3
Removing intermediate container 96ddd5b0f3bf
 ---> 7d8b901f3939
Successfully built 7d8b901f3939
Successfully tagged alpine:test2

$ docker build -t alpine:test1-3 - <<HITTT
> FROM alpine
> RUN echo "test1"
> RUN echo "test1-3"
> HITTT
Sending build context to Docker daemon  3.072kB
Step 1/3 : FROM alpine
 ---> 965ea09ff2eb
Step 2/3 : RUN echo "test1"
 ---> Using cache
 ---> 0e93df31058d
Step 3/3 : RUN echo "test1-3"
 ---> Using cache
 ---> 266bcc6933a8
Successfully built 266bcc6933a8
Successfully tagged alpine:test1-3

Cache was again used for last build. What does docs mean by "if not"?

This phrase "if not" refers to a command building a layer using the same instructions as the prior run. It is simply saying that if a command is not the same, then the layer cache invalidation starts at that point, and flows right the way down the Dockerfile. — halfer, Dec 11 '19 at 13:38
@halfer, but cache is not invalidated, it is just not used for this build, which iIMHO is quite different from common usage of `cache invalidation` — Alex Martian, Dec 11 '19 at 13:42
The cache is not per container, it is per container layer. Each command creates a new layer, that sits on top of the old one. Thus, layers are invalidated from the changed command, and everything after that. Layers from the cache prior to the changed command are still used in the build. — halfer, Dec 11 '19 at 13:43
You can see this in your second build - `Step 2/3 : RUN echo "test1"` is cached, but `Step 3/3 : RUN echo "test1-3"` is not. — halfer, Dec 11 '19 at 13:45
[Related reading](https://stackoverflow.com/questions/31222377/what-are-docker-image-layers). — halfer, Dec 11 '19 at 13:47
@halfer, you mean there is cache for alpine:test1 and separate cache for e.g. alpine:test1-3? (for same `RUN echo "test1"`)? — Alex Martian, Dec 11 '19 at 13:47
Sort of. It is per layer, but layers do not operate independently. Layer 1 goes into the cache (using a hash), then Layer 1+2, then Layer 1+2+3, etc. Each layer merges with the prior layer, so the layer for `RUN echo "test1"` can only be reused if all the prior commands (layers) are unchanged. — halfer, Dec 11 '19 at 13:49
Put another way, the hash for `RUN echo "test1"` takes the command itself, plus the prior layer's hash into account, which is why it is not independently cacheable. — halfer, Dec 11 '19 at 13:51
Yes, BMitch is one of the most active contributors to the Docker tags on Stack Overflow `:=)`. — halfer, Dec 11 '19 at 13:55

Zeitounator · Answer 1 · 2021-08-21T12:23:12.863

Let's focus on your original problem (regarding apt-get update) to make things easier. The following example is not based on any best practices. It just illustrates the point you are trying to understand.

Suppose you have the following Dockerfile:

FROM ubuntu:18.04

RUN apt-get update
RUN apt-get install -y nginx

You build a first image using docker build -t myimage:latest .

What happens is:

The ubuntu image is pulled if it does not exist
A layer is created and cached to run apt-get update
A layer is created an cached to run apt install -y nginx

Now suppose you modify your Docker file to be

FROM ubuntu:18.04

RUN apt-get update
RUN apt-get install -y nginx openssl

and you run a build again with the same command as before. What happens is:

There is already an ubuntu image locally so it will not be pulled (unless your force with --pull)
A layer was already created with command apt-get update against the existing local image so it uses the cached one
The next command has changed so a new layer is created to install nginx and openssl. Since apt database was created in the preceding layer and taken from cache, if a new nginx and/or openssl version was released since then, you will not see them and you will install the outdated ones.

Does this help you to grasp the concept of cached layers ?

In this particular example, the best handling is to do everything in a single layer making sure you cleanup after yourself:

FROM ubuntu:18.04

RUN apt-get update  \
    && apt-get install -y nginx openssl \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

But then the RUN command for apt-get invalidates all further layers in every build... at least that's what I think is happening to me — JorgeeFG, Feb 18 '22 at 01:55
There's only one layer in my last example (i.e. a single `RUN` command). Once built, cache is used if you don't change the command or parameters that influence it (previous layer, ARG value change...) — Zeitounator, Feb 18 '22 at 07:12
Ok I understand, I think the problem is in another place as I always get the message "downloaded newer image for phpdockerio/7.4 so I think that's the real problem — JorgeeFG, Feb 18 '22 at 11:49

score 4 · Accepted Answer · answered Dec 11 '19 at 13:48

4

The phrasing of the line would be better said:

If not, there is a cache miss and the cache is not used for this build step and any following build step of this stage of the Dockerfile.

That gets a bit verbose because a multi-stage Dockerfile can fail to find a cache match in one stage and then find a match in another stage. Different builds can all use the cache. The cache is "invalidated" for a specific build process, the cache itself is not removed from the docker host and it continues to be available for future builds.

answered Dec 11 '19 at 13:48

BMitch

231,797
42
475
450

yes, I myself thought so, just wanted to be sure. Is it worth making contribution edit somehow (never done that before) or maybe only me was confused? – Alex Martian Dec 11 '19 at 13:50
@AlexeiMartianov I've made a fair number of edits because of questions from others like yourself. If you don't want to, I'll be more than happy to send a PR over with better phrasing. – BMitch Dec 11 '19 at 13:57
1

I'll be happy to try to do that myself, https://docs.docker.com/opensource/ says one can click edit page to do that. – Alex Martian Dec 11 '19 at 14:04
BMitch, hi! I've made this small edit and now it is almost a month passed and it is still pending review. Is it normal? (https://github.com/docker/docker.github.io/pull/10011). – Alex Martian Jan 09 '20 at 12:08
@AlexeiMartianov see the other PR's submitted to that repo to see what's normal. – BMitch Jan 09 '20 at 14:21

What causes a cache invalidation when building a Dockerfile?

2 Answers2