3

I am installing several R packages from CRAN via docker file. Below is my docker file:

FROM r-base:4.0.2
RUN apt-get update \
      && apt-get install -y --auto-remove \
      build-essential \
      libcurl4-openssl-dev \
      libpq-dev \
      libssl-dev \
      libxml2-dev \
      && R -e "system.time(install.packages(c('shiny', 'rmarkdown', 'Hmisc', 'rjson', 'caret','DBI', 'RPostgres','curl', 'httr', 'xml2', 'aws.s3'), repos='https://cloud.r-project.org/'))"
RUN mkdir /shinyapp
COPY . /shinyapp    
EXPOSE 5000
CMD ["R", "-e", "shiny::runApp('/shinyapp/src/shiny', port = 5000, host = '0.0.0.0')"]

The docker build process is taking too much time (25 to 30 minutes). Below are the execution time details after completion of build.

user   system  elapsed 
1306.268  232.438 1361.374 

Is there any way to optimize above Dockerfile? Any way to install packages in parallel manner?

Note: I have also tried rocker/r-base, but didn't find any luck in installation speed.

Prasad Deshmukh
  • 300
  • 1
  • 10
  • maybe useful; https://stackoverflow.com/questions/51500385/how-to-speed-up-r-packages-installation-in-docker and this blog post and links therein http://dirk.eddelbuettel.com/blog/2020/08/26/#029_introducing_bspm – user20650 Dec 09 '20 at 12:15
  • if you install your packages using a separate command for each one, then installation progress gets cached and next time around you start at the line where changes began. You do not have to install all packages again that way. – janderkran Dec 09 '20 at 15:50

1 Answers1

4

‘pak’ performs package download and installation in parallel.

Unfortunately the current CRAN version of ‘pak’ (0.1.2.1) is arguably broken: it has tons of dependencies. By contrast, the development version on GitHub has no external dependencies, as it should. So we need to install that one.

So you could change your Dockerfile as follows:

…
 && Rscript -e "install.packages('pak', repos = 'https://r-lib.github.io/p/pak/dev/'); pak::pkg_install(c('shiny', 'rmarkdown', 'Hmisc', 'rjson', 'caret','DBI', 'RPostgres','curl', 'httr', 'xml2', 'aws.s3'))"
…

But, frankly, that’s quite unreadable. A better approach would be to use ARG or ENV to supply the packages to be installed (this is regardless of whether we use ‘pak’ to install packages):

FROM r-base:4.0.2

ARG PKGS="shiny, rmarkdown, Hmisc, rjson, caret, DBI, RPostgres, curl, httr, xml2, aws.s3"

RUN apt-get update \
 && apt-get install -y --auto-remove \
    build-essential \
    libcurl4-openssl-dev \
    libpq-dev \
    libssl-dev \
    libxml2-dev

RUN Rscript -e 'install.packages("pak", repos = "https://r-lib.github.io/p/pak/dev/")' \
 && echo "$PKGS" \
  | Rscript -e 'pak::pkg_install(strsplit(readLines("stdin"), ", ?")[[1L]])'

RUN mkdir /shinyapp
COPY . /shinyapp

EXPOSE 5000
CMD ["Rscript", "-e", "shiny::runApp('/shinyapp/src/shiny', port = 5000, host = '0.0.0.0')"]

Also note that R shouldn’t be invoked via the R binary for scripted use — that’s what Rscript is for. Amongst other things it handles stdout better.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • 1
    Yes, the CRAN version of pak is broken and cannot install dependencies properly. The dev version on GitHub worked amazingly for me. It took 10 minutes less to complete the build. Thanks you! – Prasad Deshmukh Dec 09 '20 at 19:00