Best practice to cache downloaded resources between builds

Question

I am building a web application I'd like to deploy as a Docker container. The application depends on a set of assets stored in a separate Git repository. The reason for using a separate repository is that the history of that repository is much larger than the current checkout and we'd like to have a way to throw away that history without touching the history of the repository containing the source code.

In the example below, containing only the relevant parts, I'm passing the assets repository commit ID into the build process using a file:

FROM something:something

# [install Git and stuff]

COPY ["assets_git_id", "/root/"]
RUN git clone --bare git://lala/assets.git /root/assets.git \
    && mkdir -p /srv/app/assets
    && git --git-dir=/root/assets.git --work-tree=/srv/app/assets checkout $(</root/assets_git_id) .
    && rm -r /root/assets.git

# [set up the rest of the application]

The problem here is that whenever that ID changes, the whole repository is cloned during the build process and most of the data is thrown away.

What is the canonical way reduce the wasted resources in such a case? Ideally I'd like to have access to a directory from inside the container during build whose contents are kept between multiple runs of the same build. The RUN script could then just update the repository and copy the relevant data from it instead of cloning the whole repository each time.

If you just want the files of a specific commit then you should detach those files from git. Export them into an archive, extract and delete the archive. See here: https://stackoverflow.com/questions/11018411/how-do-i-export-a-specific-commit-with-git-archive — blacklabelops, Oct 01 '17 at 10:10
Why are you cloning the full repo? Why don't you use something like `git clone --work-tree=/srv/app/assets --git-dir=/root/assets.git --depth 1 -b git://lala/assets.git`. This will limit the history — Tarun Lalwani, Oct 05 '17 at 06:55
@TarunLalwani That would only work if I create a new branch each time I want to update the the image and never re-use the branch names or push new commit to the branches. Otherwise Docker will use a cached image and defeat the purpose. Git can't create a shallow clone from a commit ID. — Feuermurmel, Oct 07 '17 at 14:15
Then you case try something like https://github.com/grammarly/rocker — Tarun Lalwani, Oct 07 '17 at 16:11

Robert · Answer 1 · 2017-09-30T21:43:12.947

0

Use multi stage builds

# Stage 1
FROM something:something as GitSource

# [install Git]

RUN git clone --bare git://lala/assets.git /root/assets.git
COPY ["assets_git_id", "/root/"]
RUN git --git-dir=/root/assets.git pull

RUN mkdir -p /srv/app/assets
RUN git --git-dir=/root/assets.git --work-tree=/srv/app/assets checkout $(</root/assets_git_id) .

# Stage 2
FROM something:something

COPY --from=GitSource /srv/app/assets /srb/app/assets
# [set up the rest of the application]

For the final image, it will discard whatever you do in Stage 1, except what is being copied to Stage 2.

edited Sep 30 '17 at 21:43

answered Sep 30 '17 at 20:53

Robert

33,429
8
90
94

Hi! I am familiar with multi-stage builds. But I don't see how that would improve anything. AFAICT this will still clone the whole repository every time `assets_git_id` changes and create an image with the same size. What am I missing? – Feuermurmel Sep 30 '17 at 21:06
Thank you for trying but this solution just delays the inevitable. The repository in the cached image created in the `RUN git clone ...` step will get more and more outdated and the `RUN git --git-dir=... pull` step has to download more stuff over and over again. – Feuermurmel Sep 30 '17 at 21:54
It's the best approach that I can imagine. The docker daemon needs either the cached layers and the context that you send to it. The git stuff must be there or in the internet. From time to time you can discard the cached git-clone layer to update it. – Robert Sep 30 '17 at 22:08

Best practice to cache downloaded resources between builds

1 Answers1