4

Assume export DOCKER_BUILDKIT=1.

Take main.py:

i = 0

while True:
    i += 1

Take this Dockerfile:

FROM python:3.9-slim as base
COPY main.py .

FROM base as part_1
RUN echo "A" && python -m main

FROM base as part_2
RUN echo "B" && python -m main

FROM base as combined
COPY --from=part_1 . .
COPY --from=part_2 . .

Running docker build --no-cache . followed by top shows that the build is being parallelized to take 2 cores, expected from BuildKit:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND              
  22569 root      20   0   14032  11620   4948 R 100.0   0.0   0:10.43 python               
  22571 root      20   0   14032  11620   4948 R 100.0   0.0   0:10.34 python                            

But removing the echos from the Dockerfile:

FROM python:3.9-slim as base
COPY main.py .

FROM base as part_1
RUN python -m main

FROM base as part_2
RUN python -m main

FROM base as combined
COPY --from=part_1 . .
COPY --from=part_2 . .

and rerunning docker build --no-cache . followed by top shows that the build is only taking one core (with the second process being an irrelevant one), unexpected from BuildKit:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND  
  24674 root      20   0   14032  11624   4952 R 100.0   0.0   1:00.40 python   
   2485 mishac    20   0 5824548 515428 126120 S  12.3   1.6   2:52.74 gnome-s+ 

Why is the version without the echos disabling the parallelization? It seems like an odd thing to be affecting it. Is it possible to keep the parallelization without the echos?

Version:

$ docker --version
Docker version 20.10.16, build aa7e414
Mario Ishac
  • 5,060
  • 3
  • 21
  • 52

1 Answers1

4

Buildkit uses a low-level builder format (LLB) to compute a content addressable dependency graph. This allows it to optimize the build process by directly tracking the checksums of build graphs. All stages are analyzed before any processing is done.

Since you are starting from the same the same base image and executing the same RUN command in each stage, Buildkit determines that this will produce the same output and only performs this operation once.

When you add the echo command, you introduce a variance in the dependency graph that causes it to build two separate images, which it does in parallel as you expect. If you RUN a different script or COPY some unique file(s) in each stage they will build in parallel. Even just setting a unique ENV is enough to trigger this.

Below is a very minimal test that demonstrates this behavior (using alpine as a base image which is only around 5.5MB) :

test.sh

#!/bin/sh
sleep 10
touch /test

Dockerfile A

FROM alpine AS base
WORKDIR /run
COPY ./test.sh .

FROM base AS first
RUN /run/test.sh

FROM base AS second
RUN /run/test.sh

FROM base AS output
COPY --from=first /test .
COPY --from=second /test .

Command

sudo DOCKER_BUILDKIT=1 docker build --no-cache .

Output

enter image description here

You can see that the first stage is skipped, and the second stage took just over 10 seconds to complete. Yet the COPY command in the output stage has no trouble reading from the first stage.

Now, if we add an ENV with a unique value in each stage...

Dockerfile B

FROM alpine AS base
WORKDIR /run
COPY ./test.sh .

FROM base AS first
ENV test=A
RUN /run/test.sh

FROM base AS second
ENV test=B
RUN /run/test.sh

FROM base AS output
COPY --from=first /test .
COPY --from=second /test .

Both stages are built in parallel :

enter image description here

Besworks
  • 4,123
  • 1
  • 18
  • 34