10

I'm trying to generate and re-use a yarn install cache when building a Docker image using Docker BuildKit. The yarn cache is persisted in the directory .yarn/cache and should never be included in the final image (.yarn/cache is relative to the build context root). The .yarn/cache directory should be shared among multiple builds in order to always start from a warm cache and have a fast yarn install command (even when we have a cache miss due to a change in package.json). If we could have access to .yarn/cache content after docker build ends, will be easy to share between multiple builds, for example uploading it to an Amazon S3 or GCS bucket.

I've considered two options:

  1. RUN --mount=type=bind
  2. RUN --mount=type=cache

Described below why either of the two methods don't work.

(1) RUN --mount=type=bind

The (simplified) Dockerfile looks like this:

ENV YARN_CACHE_FOLDER ".yarn/cache"
COPY package.json yarn.lock ./
RUN --mount=type=bind,source=.yarn/cache,target=.yarn/cache,rw yarn install --frozen-lockfile

Unfortunately no data is present in .yarn/cache directory after docker build command ends.

The reason that no data is persisted is described in the rw option documentation: Allow writes on the mount. Written data will be discarded. If the written data is discarded, what's a working method for generating the cache the first time?

(2) RUN --mount=type=cache

Alternatively I considered using RUN --mount=type=cache. Unfortunately there doesn't seem to be an easy way of persisting the cache in a local directory of the build host for being easily saved to an Amazon S3 or GCS bucket. If the cache is not persisted, we can't use it across different Cloud Builds if the Docker daemon state is not shared between them.

To say it in another way: what is the best method for sharing a cache directory between different docker build that are running on different machines, without including this cache in the image? Is there any other way I'm missing here?

  1. RUN --mount=type=bind: allow to mount a directory as if it was local, but effectively doesn't allow to write to that directory, so I can't generate the cache on the first run.
  2. RUN --mount=type=cache: allow to share the cache between multiple builds on the same machine, but if we're running multiple different docker build (on different machines) it won't help because the cache is always empty.
  • 1
    Can you share your cloudbuild.yaml file? Or explain how do you want to reuse cache? In the same build? between different built? at runtime elsewhere? – guillaume blaquiere Mar 22 '22 at 07:52
  • The cloudbuild.yaml contains a single docker build command and the download/upload cache command (local directory in /workspace) to an S3 or GCS bucket. I want to be able to have a warm cache when I trigger the build multiple times, the builds will run on different machines. – Gianluca Venturini Mar 22 '22 at 18:41
  • I'm confused.... I still don't understand. Do you want the cache when you run the container or when you build it? In your dockerfile,.... no I don't understand. It's like a `docker run` but inside a `RUN` statement in the Dockerfile. I'm totally lost, or too bad in Dockerfile – guillaume blaquiere Mar 22 '22 at 20:48
  • I'm interested in mounting the cache during `docker build`. As you can see from the documentation that I linked in the question (https://hub.docker.com/r/docker/dockerfile) using Docker BuildKit you can use the `--mount` syntax in a Dockerfile `RUN` command, but it seems somewhat limited because you can't really "write" the result in the directory that is mounted. – Gianluca Venturini Mar 23 '22 at 03:21
  • 1
    Great, I learnt something! So, 2 things: firstly, try to use an absolute path `/workspace/.yarn/cache`. Secondly, it will work only if you perform several Docker Build in the same Cloud Build step and in the same Cloud Build execution. Else, the data will disappear (it's not exactly true, you can backup them to CLoud Storage if you want, as detailed [here](https://cloud.google.com/build/docs/building/store-build-artifacts#uploading_files_and_folders)) – guillaume blaquiere Mar 23 '22 at 09:56
  • Thank you for explaining more about `--mount type=cache`! I've been wondering what happens with cloud CI builds when using that flag, but the documentation never covered the more practical aspects of using it. – Jules Aug 14 '22 at 06:32

3 Answers3

0

If you are using the default base node image, e.g. node:16, you can accomplish what you want by doing something like that:

FROM node:16
WORKDIR /app
COPY package.json .
COPY yarn.lock .
RUN --mount=type=cache,target=/usr/local/share/.cache/yarn \
    yarn install --frozen-lockfile --ignore-scripts --production
COPY . .
RUN yarn build
Charles Santos
  • 739
  • 9
  • 12
0

Based on using docker build in your OP, I'd recommend you look at the output of docker build --help. There is a --cache-from and --cache-to flag you can pass on the command line to tell docker build where to save your cache.

Speeddymon
  • 496
  • 2
  • 20
0

RUN --mount=type=cache is the correct approach here, because as you already discovered, rw access to the cache is necessary for it to be of any use across builds. Additionally, --cache-from and --cache-to explicitly do not include these types of cache mounts, so your cache will not be persisted across CI runs in this way.

What we therefore need are pre and post-run steps to pull/push the contents of the cache mount to S3 before and after each run. You can achieve this "cache dance" as follows:

date --iso=ns | tee scratchdir/buildstamp
docker buildx build -f scratchdir/Dancefile.inject scratchdir
<run your buildx build here>
date --iso=ns | tee scratchdir/buildstamp
docker buildx build -f scratchdir/Dancefile.extract scratchdir

The timestamp in the scratchdir is necessary to bust layer caching carried out by Docker. Create the the inject/extract Dockerfiles in the cachedir as well, and adjust them to suit your use case by adding more cache mounts and sync commands for each cache directory you want to sync to S3. The following example demonstrates transferring the .yarn/cache directory:

# Dancefile.inject
FROM peakcom/s5cmd:v2.0.0
COPY buildstamp buildstamp
RUN --mount=type=cache,sharing=shared,id=yarn,target=/builddir/.yarn/cache \
    /s5cmd sync s3://cache-bucket/yarn/* /builddir/.yarn/cache
# Dancefile.extract
FROM peakcom/s5cmd:v2.0.0
COPY buildstamp buildstamp
RUN --mount=type=cache,sharing=shared,id=yarn,target=/builddir/.yarn/cache \
    /s5cmd sync /builddir/.yarn/cache/* s3://cache-bucket/yarn/ 

With this process, the cache mount directories will be populated from S3 and available to any other buildkit build contexts using the same cache mount id. You can freely adjust the location of the cache mounts in your build Dockerfile, since the cache mounts are identified by their id, not mount point.

Further reading:

  • original post, full credit breaking apart this problem and the idea with the solution should go here
  • a GitHub Action implementing the same
  • my fork of the above GitHub Action, with the goal of supporting a very specific Rust/Yarn build, increasing performance by reducing the number of copy operations, and using s5cmd as shown above to further accelerate transfers.
strophy
  • 1
  • 1
  • Interesting. If you have a "cache" image around, how is it better than simply copying the desired files from image w/ temporary container? Is this approach better in terms of I/O? – Cyclone Jul 17 '23 at 20:19
  • The cache is not stored in an image, it is stored in S3 as the actual raw contents of the cached directories. It needs to be loaded through a build command in order to populate the Buildkit cache, which is empty when a new ephemeral runner comes up. – strophy Jul 20 '23 at 02:38