3

Assuming I have 2 options to add docker layers.

Option1:

RUN python -m nltk.downloader punkt averaged_perceptron_tagger brown

Option2:

RUN python -m nltk.downloader punkt 
RUN python -m nltk.downloader brown
RUN python -m nltk.downloader averaged_perceptron_tagger 

I understand that the 2nd option adds 3 layers whereas 1st option adds only 1 layer.

Does number of layers have implication on size, setup time or performance of current and future docker images?

Note: Current means the current image. Future means any image that may use some of layers from an existing image thus speeding up the setup.

variable
  • 8,262
  • 9
  • 95
  • 215

2 Answers2

3

It does impact setup time and might impact size. It should not performance, meaning by performance the actual performance of the running app.

It does impact setup time because, the better you define your layers, the better they can be reusable for other images. At the end of the day, a layer is just a cache, if it can be shared for other images, the build time will be improved.

Regarding size, it really depends on how you build the image. For example, if you have build dependencies that are not needed on runtime, the image will be bigger because it will have such dependencies. Speaking of python, usually you will want to install build-essential to build your app, however, once the packages are installed, you no longer need build-essential. If you don't remove it, the image will be bigger.

To remove it, you have two options:

  • Either you use a long RUN statement in which you install build-essential, install the packages you need, and then remove build-essential, all in the same RUN.
  • Just use multi-staging, and have different stages for building and running.
2

At a practical level, and especially at the level of single layers, it makes no difference. There used to be a documented limit of 127 layers in an image; most practical images have fewer than 20. In principle going through the Docker filesystem layers could be slower if there are more of them, but Linux kernel filesystem caching applies, and for most performance-sensitive things, avoiding going to disk at all is usually best.

As always with performance considerations, if it really matters to you, measure it in the context of your specific application.

I'd say there are really three things to keep in mind about Docker image layers:

  1. Adding a layer never makes an image smaller. If you install something in an earlier RUN step, and remove it in a later RUN step, your image winds up with all of the installed content from the earlier layer, and an additional placeholder layer that says "this stuff is deleted now". This particularly happens around build tools. @eez0 discuses this case a little more in their answer.

  2. Layers are the unit of Docker image caching. If you repeat a docker build step and you're running an identical command on an exact layer that already existed, Docker will skip actually running it and reuse the resulting layer from the previous build. This has a couple of impacts on Dockerfile style (you always want to RUN apt-get update && apt-get install in the same command, in case you change the list of packages that get installed) but doesn't really impact performance.

  3. You can docker run the result of an individual step. This is a useful debugging technique: the output of docker build includes an image ID for each step, and you can docker run your intermediate results. If some step is failing you can get a debugging shell on the image up to the start of that step and see what's actually in the filesystem.

In your example, the two questions worth asking are how much overhead an individual downloader step has, and how likely this set of plugins is to change. If you're likely to change things often, separate RUN commands will let you cache layers better on later builds; if the downloader tool itself has high overhead (maybe it assembles all downloaded plugins into a single zip file) then running it just once could conceivably be faster.

Usual practice is to try to pack things together into a single RUN command, but it is kind of a micro-optimization that usually won't make much of a difference in practice. In the case of package installations I'm used to seeing just one apt-get install or pip install line in a Dockerfile and style-wise I might expect that here too. If you're in the process of developing things, one command per RUN line is easier to understand and debug.

David Maze
  • 130,717
  • 29
  • 175
  • 215