9

Say you have the following list of packages you would like to install for a docker image

("jsonlite","dplyr","stringr","tidyr","lubridate",
"knitr","purrr","tm","cba","caret",
"plumber","httr")

It actually takes around 1 hour to install these!

Any suggestions into how to speed up such a thing ? (or how to prevent the re-installation at every new image build ?)

Side note

I do not install these packages from the dockerfile like this:

RUN Rscript -e "install.packages('stringr')
...

Instead I create an R script Requirements.R which installs these packages and simply do:

RUN Rscript Requirements.R

Is these less optimal than installing the packages directly from the Dockerfile ?

AnarKi
  • 857
  • 1
  • 7
  • 27

3 Answers3

15

Use binary packages where you can as we often do in the Rocker Project providing multiple Docker files for R, including the official r-base one.

If you start from Ubuntu, you get Michael's PPAs with over 3000+ packages; if you start from Debian you get fewer from the distro but still many essential ones. (There are some efforts to bring more binary packages to Debian but nothing is up right now.)

Lastly, Dockerfile creation is of course compile time too. You spend the time once (per container creation) and re-use potentially many time after. Also, by using the Docker Hub you can avoid spending your local cpu cycles.

Edit in Sep 2020: The (updated) Ubuntu PPA now has over 4600 package for the three most recent LTS releases. Still highly, highly recommended.

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • You mentioned using binary packages in Rocker, but when I look at the rocker/tidyverse code I see statatements `install2.r tidyverse`, which seems to install from source on my machine. Is there a way to make these statements load from binaries? Thanks! – Jacqueline Nolis Sep 11 '18 at 19:02
  • 2
    Of course. For Ubuntu, point to the Rutter PPA and use it; I do that (outside of Docker) in my r-travis repo. I just committed a few Dockerfiles that do that, see for example [this one for Rcpp](https://github.com/RcppCore/Rcpp/blob/master/docker/Dockerfile) -- it installs multiple packages from Debian as `r-cran-*` binaries and two not-available-as-binary packages from CRAN. – Dirk Eddelbuettel Sep 11 '18 at 19:15
  • @DirkEddelbuettel unfortunately the link in the comment is dead. I think you meant this https://github.com/RcppCore/Rcpp/blob/master/docker/ci/Dockerfile and I add it for later reference. Thank you! – Richi W Jul 12 '19 at 19:20
  • Thanks for the correction. I also had a followup blog post in June that is relevant. No link as I am traveling and typing on a phone now.... – Dirk Eddelbuettel Jul 13 '19 at 04:49
  • Dear Dirk; does installing the ubuntu binaries also install all the dependencies i.e. caret has lots -- or do you need to step through the dependency tree? – user2957945 Sep 22 '20 at 23:53
  • Distribution package generally *do* have correct and full dependencies, so the answer here is full YES!!. See a few recent [blog posts with examples](http://dirk.eddelbuettel.com/blog/code/r4/) where I show this in a few posts. – Dirk Eddelbuettel Sep 23 '20 at 00:09
12

I found an article that described how to install R packages from precompiled binaries. It reduced the build time on our Jenkins server from 45 minutes down to 3 minutes.

Here is my Dockerfile:

FROM rocker/r-apt:bionic
WORKDIR /app
RUN apt-get update && \
  apt-get install -y libxml2-dev

# Install binaries (see https://datawookie.netlify.com/blog/2019/01/docker-images-for-r-r-base-versus-r-apt/)
COPY ./requirements-bin.txt .
RUN cat requirements-bin.txt | xargs apt-get install -y -qq

# Install remaining packages from source
COPY ./requirements-src.R .
RUN Rscript requirements-src.R

# Clean up package registry
RUN rm -rf /var/lib/apt/lists/*

COPY ./src /app

EXPOSE 5000
CMD ["Rscript", "Server.R"]

You can add a file requirements-bin.txt with package names:

r-cran-plumber
r-cran-quanteda
r-cran-irlba
r-cran-lsa
r-cran-caret
r-cran-stringr
r-cran-dplyr
r-cran-magrittr
r-cran-randomforest

And finally, a requirements-src.R for packages that are not available as binairies:

pkgs <- c(
    'otherpackage'
)

install.packages(pkgs)
Jodiug
  • 5,425
  • 6
  • 32
  • 48
  • 2
    This just saved us years of build time. Should be tagged as the solution for this question. – kummerer94 Feb 05 '20 at 14:22
  • 1
    That is exactly _the same answer_ as mine from a year earlier: Use binaries where you. And you, just like me, point at Ubuntu and the add-on repos. The Rutter PPAs now have over 4600 prebuilt packages. – Dirk Eddelbuettel Sep 23 '20 at 00:10
  • 1
    Fair, I rephrased the first line of my answer. It was very similar to what you suggested. Not different, only more concrete. – Jodiug Sep 23 '20 at 09:02
  • thank you for this, you've helped me significantly improve my dockerfile build time! – Gabriel Pulga Mar 05 '21 at 19:35
  • 1
    The linked article appears unrelated and instead concerns twitch videos. It seems the intended target has moved. – jwalton Mar 08 '21 at 11:09
4

I ended up using rocker/r-base as @DirkEddelbuettel suggested. Also thanks to this How to avoid reinstalling packages when building Docker image for Python projects? I wrote my Dockerfile in a way that doesen't reinstall packages every time I rebuild my docker image.

I want to share how my Dockerfile looks like now, hopefully this will be of help to others:

FROM rocker/r-base

RUN apt-get update

# install packages 
RUN apt-get -y install libcurl4-openssl-dev
RUN apt-get -y install libssl-dev

# set work directory 
WORKDIR /myapp

# copy requirments R script
COPY ./Requirements.R /myapp/Requirements.R

# run requirments R script
RUN Rscript Requirements.R

COPY . /myapp

EXPOSE 8094

ENV NAME R-test-service

CMD ["Rscript", "my_R_api.R"]
AnarKi
  • 857
  • 1
  • 7
  • 27