1

I am embedding a rather large (multi-GB) database file into an image with a standard Go build. This is a read-only database that we create in a separate process then run on our k8s cluster. The file cannot be on a mounted volume for business reasons that may or may not be valid, but it's currently a constraint I need to respect.

I would like to speed up the docker build by using cache when possible, and avoid a re-push of the database when it has not changed (but code has).

The current build copies the file after the code compile, meaning any change to the code invalidates the layer with the DB and forces a pull, move to context, and push whenever we change code. This makes for a very long build. So I want to add the DB to the image before the build.

The existing build does a standard Go build with one bit I can't easily change:

ENV CGO_ENABLED=1
WORKDIR /src
COPY ./go.mod ./
COPY ./go.sum ./
RUN go mod download

COPY . ./       

ARG version="dev"
RUN go install -mod=readonly -ldflags="-s -w" -tags netcgo .

COPY /tmp/my-big-database-file /data

I cannot seem to get the db file where I want in the resulting image. When I use COPY I end up with the file in two places.

I want the DB file in the image as a layer before the code as the DB does not change often, but code does, so the layer with the DB is often cached, plus I can can avoid pushing the layer with the DB file if it hasn't changed.

I can't figure out how to achieve this, given some constraints with the current build setup.

The COPY . ./ part is the issue: it copies that my-big-database-file from /data to /src/data along with all the other file, leaving me with 2 copies of the DB.

Is there a way that I can exclude, remove or otherwise end up with a single copy of my-big-database-file living in the /data directory?

I have tried:

  • .dockerignore which does exactly what one would expect, which is to entirely exclude the file from the image. I need the file in the image; the problem is that I need only one instance (it's a huge file) and it needs to be in a specific location.

  • Remove the COPY /tmp/my-big-database-file /data/my-big-database-file - I do end up with just one file in /tmp, not in the place the code expects.

  • Put the DB in a folder excluded by .dockerignore but then it's completely missed.

  • A multi-stage build. Same result as above. Multi-stage with COPY --from in the go build stage results in 2 copies of the file.

  • COPY --link but I don't think that's what it solves.

  • RUN rm /tmp/my-big-database-file but of course that just adds another layer and doesn't accomplish anything.

It's not immediately practical for me to reorganize the source files (e.g. put them all under /src in git). There are a number of files at the root that I need for subsequent phases of the larger build. Similarly, there are many files and directories at the root level, and selectively copying would work, but creates a fragile dependency.

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
Tom Harrison
  • 13,533
  • 3
  • 49
  • 77
  • Does this answer your question? [COPY with docker but with exclusion](https://stackoverflow.com/questions/43747776/copy-with-docker-but-with-exclusion) – jonrsharpe May 12 '23 at 11:09
  • Are you running `docker build` from `/` on the host? And... is this database read-only or something? How do people use it at all if its inside the container where changes will be lost? – Jeffrey Mixon May 12 '23 at 11:26
  • Could you use a [volume](https://docs.docker.com/storage/volumes/) to avoid the copying or do you have to copy everything for a reason? – Peter Krebs May 12 '23 at 11:38
  • Thanks @jonrsharpe, updated question as to why .dockerignore may not be viable in my case. jeffrey-mixon yes, docker build from / -- would need to make substantial changes in our build pipeline to do otherwise. It's an R/O database. peter-krebs, unfortunately volume will not work in our case due to business constraints. – Tom Harrison May 12 '23 at 13:52
  • 1
    That does have other answers too, though, and is also the place _new_ solutions should go. Given current Docker functionality, selectively copying what you _do_ want is about as good as you can get. – jonrsharpe May 12 '23 at 13:56

2 Answers2

2

The solution to the specific problem I was having was to split the docker part into to two:

  1. A simple Dockerfile that used the base image as usual, then COPY the two data files and push.
  2. The usual golang dockerfile, except the base image was the one from the first step.

This actually made more sense anyway because the code that updated the data files is part of a separate process, distinct from the code that runs and reads from them. Once the files were created, we made an API call to out build system which had all the setup needed to manage container creation and management.

Tom Harrison
  • 13,533
  • 3
  • 49
  • 77
0

A multi-stage build seems like it can be a good solution for you here. Since Go is a compiled language, your final image doesn't need to include the Go toolchain, and omitting it can make the image significantly smaller. Of particular note for your question, the only thing docker push will push is the final image, so as long as you COPY things into the final image in the right order you should be able to get the behavior you want.

Your existing Dockerfile can pretty much be the first build stage as-is. Splitting out the go mod download step is important for layer-caching reasons, but you don't need to do anything special about the large database file.

FROM golang:1.20 AS build
WORKDIR /src
...
COPY . ./       
...
RUN go install -mod=readonly -ldflags="-s -w" -tags netcgo .

In the final stage, first copy the database file, and second copy the application itself. That will get you the behavior you want when you push the image. Continuing in the same Dockerfile:

FROM ubuntu:22.04
# First copy the database file
COPY --from=build /src/data/my-big-database-file /data
# Second copy the binary `go install .` built
COPY --from=build /src/myapp /usr/local/bin/
# Set up normal metadata to run a container
CMD ["myapp"]

You must include the database file in the build context for it to be copied into the image; you cannot avoid that bit of slowness. But with this sequence you will avoid re-downloading Go modules unless go.mod or go.sum change; if the application code or modules change you will rebuild, but the resulting image will reuse the base image plus the database file, and you'll only push the final layer with the application binary; and if the database data changes, you wind up rebuilding and pushing everything.

David Maze
  • 130,717
  • 29
  • 175
  • 215
  • This was the path I tried in multiple different attempts -- my first attempt was just to get the big files handled before any other so they would be their own layer and not get re-pulled, pushed, etc at every build. But the core realization is that there were two cases: 1) needed to save the files after they were created in another process, 2) needed to build the code that used them. So step one was to create an image with the files, step 2 was to use that as a base for the code that used them. Separating concerns did the trick. – Tom Harrison Jun 01 '23 at 22:28