0

I frequently seem to have to write Dockerfiles like this (line numbers added for clarity):

1. FROM somebase
2. RUN cp /some/local/stuff /some/docker/container/path
3. RUN some-other-local-commands
4. RUN wget http://some.remote.server/some.remote.path.for.example.json
5. RUN some-other-local-commands-which-may-depend-on-the-json

On line (4), I'm fetching a remote resource. Let's assume for now that's a JSON file. It might change from time-to-time, maybe not on every build, but perhaps every few hours or days.

What this means is that every time I build my container, I want to ensure the freshest JSON file is fetched. One way to force this is to add the --no-cache parameter to my docker build command, but this forces all of the lines/layers to rebuild, including (1)-(3), where that is likely not necessary. Is there a pattern or technique to automatically 'taint' or 'mark' line (4) so that Docker knows it always has to re-run the wget (presumably this would also have to force a rebuild of line 5), whilst still getting the layer caching behaviour for lines (1)-(3) when Docker detects the pre-req files haven't changed?

Andrew Ferrier
  • 16,664
  • 13
  • 47
  • 76

1 Answers1

2

If the specific thing you're trying to trigger rebuilds is the result of RUN wget ... a specific URL, Docker does actually have native support for this.

There are two similar commands to copy files into a container. COPY only copies files from the build context. ADD can also fetch external URLs and unpack local archives (but not both at the same time). The general recommendation is to use COPY, unless you need one of the specific things ADD does differently.

So you should be able to say

ADD http://some.remote.server/some.remote.path.for.example.json .
RUN some-other-local-commands-which-may-depend-on-the-json

and the RUN command will use the Docker layer cache based on the contents of the fetched file.

If this approach doesn't work for you (maybe you need special authentication to fetch the file) you can also fetch the file outside of Docker before you run docker build, and then COPY it in. Again, it will work like any other file you COPY in, and layer caching will take effect based on whether the file has changed or not.

David Maze
  • 130,717
  • 29
  • 175
  • 215
  • David, thanks for your answer. I think this is "wrong", though, for two specific reasons. Firstly, you're right, in the case of `wget` then ADD will fetch a remote file, but my use of `wget` was only intended to indicate an example of an external network action. Could be a `curl` POST for example, or something completely different. Secondly, your answer is the opposite of what (hopefully!) I was asking for. I want `ADD` *not* to use the layer cache, not to use it. Otherwise, `RUN wget` will do the same job. – Andrew Ferrier Apr 29 '21 at 14:21
  • Sorry, my apologies, I re-read your answer. I understand it now. `ADD` would do the job in my case, yes. But I am looking for a more generic answer for network-based RUN lines (or even non-network-based). – Andrew Ferrier Apr 29 '21 at 14:23
  • For the specific case of fetching an external file by URL, you can use `ADD`. There's not a more general solution, though; the question @jonrsharpe links to describes a similar `ADD`-based workaround for `git clone`, but there's not a way to tell Docker "always try to `RUN` this specific step, but maybe go back to the cache if nothing is changed". – David Maze Apr 29 '21 at 14:33
  • The linked answer has some hacks that are acceptable to me (e.g. the line which calls random.org) but appreciate that's a hack. Thanks anyway for your input. I'll mark your answer as correct because it does answer the narrow interpretation of my question. – Andrew Ferrier Apr 29 '21 at 14:37
  • This will never work for git clone ... – user3613987 May 31 '22 at 18:15
  • That's true. For a variety of reasons, some related to layer caching, I generally recommend running `git clone` outside of Docker. – David Maze May 31 '22 at 19:55