1

My apologies because I think this may be a simple question but it is something that I am really struggling to understand!

As a background, I am trying to create a Dockerfile which installs a lot of R CRAN and R Bioconductor packages as well as some R packages from Github. I want to do this as quickly as possible so I'm using rocker's base image to install binary files, see here for a great, quick tutorial: https://datawookie.dev/blog/2019/01/docker-images-for-r-r-base-versus-r-apt/

My approach is first to install all my necessary packages as binaries and, if any are not available install them from source. After this, I use the Bioconductor base image to install the necessary Bioconductor packages.

However, the packages I installed through the rocker base image aren't available after I import the Bioconductor base image. This is where I feel I don't have a clear understanding of creating Dockerfiles but I can't seem to find an answer in any documentation. Is there some way to copy these over after importing another image? I didn't know if this is necessary, I have seen others do it the same way, such as the question poster here: Minimizing the size of docker image R shiny app

To note, I import the Bioconductor base image as I thought this would help deal with dependency issues. I guess I could just install the Bioconductor packages like the R packages that weren't available as binaries but I want to do this as quickly and cleanly as possible and I thought that this would slow things down.

Essentially, I want to know what's the quickest way to install, R binaries, R non-binaries, R bioconductor and github packages all in one dockerfile.

An example of my approach is below with a very small subset of the packages I need. Note, I have shown my full approach to install R binaries, R non-binaries, R bioconductor and github packages but for the issue I am having see what happens to the tidyverse R package before and after I import the Bioconductor image; the call library(tidyverse) runs before but fails after:

Dockerfile

## Use r-ubuntu, prev r-apt:bionic to enable the use of binary r packages for speed for R 4.0
FROM rocker/r-ubuntu:18.04

## Install available binaries - for speed
RUN apt-get update && \
    apt-get install -y -qq \
r-cran-tidyverse \
r-cran-ids \
r-cran-snow

## Install remaining packages from source
COPY ./requirements-src.R .
RUN Rscript requirements-src.R

## This works
RUN R -e 'library(tidyverse)'

## Install Bioconductor packages
# Docker inheritance
FROM bioconductor/bioconductor_docker:RELEASE_3_12
COPY ./requirements-bioc.R .
#Don't bother running for speed but this will run
#RUN R -e 'BiocManager::install(ask = F)' && Rscript requirements-bioc.R

#This will fail - can't find the package
RUN R -e 'library(tidyverse)'

## Install from GH the following
#Don't bother running for speed but this will run
#RUN installGithub.r mojaveazure/loomR 


EXPOSE 8787

## Make R the default
CMD [”R”]

requirements-src.R

pkgs <- c(
'spelling',
'english',
'DT'
)

install.packages(pkgs)

requirements-bioc.R

bioc_pkgs<-c(
'biomaRt',
'DropletUtils',
'rhdf5'
)

BiocManager::install(bioc_pkgs,ask=F)
A_Murphy
  • 184
  • 2
  • 14
  • While I agree that one shouldn't waste time, is there a reason you really need to reduce the image-creation time? Unless you're doing it daily, I would expect that getting it to work well and consistently would win out over trying to be frugal on package compilation time. – r2evans May 13 '21 at 13:04
  • Side question, though, are the library paths in the first (`rocker`) container masked by directories in the second (`Bioconductor`)? – r2evans May 13 '21 at 13:04
  • Image creation time is a problem as the previous image was taking hours to build and docker hub has a 2 hour build limit so it had to be build manually each time (the real use case has a lot of packages to install) - this is what I want to avoid! On the side question, I actually don't know. – A_Murphy May 13 '21 at 13:23
  • Ok, that makes sense now (I don't build on docker hub so have never run into a build-time limit). For the second, try explicitly creating a package library directory elsewhere, install into *it*, then see if it is still available after you overlay the Bioconductor base image. – r2evans May 13 '21 at 13:24
  • This is where my lack of an extensive knowledge on Docker gets in the way, I understand that approach in principle and why it should work but I haven't done something like that before. You may not know but would that approach still be useable with a github action to build the docker on push? – A_Murphy May 13 '21 at 13:59
  • No idea, sorry. – r2evans May 13 '21 at 15:50

1 Answers1

0

Just in the interest of anyone else who is facing a similar problem, I will post my solution. I am not suggesting that this is the only solution so if others find better alternatives, I'll update to it.

In the end my approach to creating docker image which installs a lot of R CRAN and R Bioconductor packages as well as some R packages from Github was:

  1. Use the latest Rocker RStudio image - to get packages installed as binary and to also enable easy debugging of your package with the correct dependencies since you can interactively run your image
  2. Install all libraries from the latest Bioconductor image - to ensure you can install any Bioconductor package without issue
  3. Install CRAN binaries
  4. Install CRAN packages from source - where binaries aren't available
  5. Install Bioconductor packages
  6. Install Github packages

My solution uses this steps in this order and should prove as a fast and efficient solution (the use case for me was an R package which required >80 other packages from CRAN, Bioconductor and Github as dependencies! This solution reduced the runtime to a fraction of the original). Also, since we are using the latest version of Rocker RStudio and packages, this should stay up-to-date with the latest versions of software and packages.

The Dockerfile looks like this:

#LABEL maintainer="John Doe"

## Use rstudio installs binaries from RStudio's RSPM service by default,
## Uses the latest stable ubuntu, R and Bioconductor versions
FROM rocker/rstudio


## Add packages dependencies - from Bioconductor
RUN apt-get update \
        && apt-get install -y --no-install-recommends apt-utils \
        && apt-get install -y --no-install-recommends \
        ## Basic deps
        gdb \
        libxml2-dev \
        python3-pip \
        libz-dev \
        liblzma-dev \
        libbz2-dev \
        libpng-dev \
        libgit2-dev \
        ## sys deps from bioc_full
        pkg-config \
        fortran77-compiler \
        byacc \
        automake \
        curl \
        ## This section installs libraries
        libpcre2-dev \
        libnetcdf-dev \
        libhdf5-serial-dev \
        libfftw3-dev \
        libopenbabel-dev \
        libopenmpi-dev \
        libxt-dev \
        libudunits2-dev \
        libgeos-dev \
        libproj-dev \
        libcairo2-dev \
        libtiff5-dev \
        libreadline-dev \
        libgsl0-dev \
        libgslcblas0 \
        libgtk2.0-dev \
        libgl1-mesa-dev \
        libglu1-mesa-dev \
        libgmp3-dev \
        libhdf5-dev \
        libncurses-dev \
        libbz2-dev \
        libxpm-dev \
        liblapack-dev \
        libv8-dev \
        libgtkmm-2.4-dev \
        libmpfr-dev \
        libmodule-build-perl \
        libapparmor-dev \
        libprotoc-dev \
        librdf0-dev \
        libmagick++-dev \
        libsasl2-dev \
        libpoppler-cpp-dev \
        libprotobuf-dev \
        libpq-dev \
        libperl-dev \
        ## software - perl extensions and modules
        libarchive-extract-perl \
        libfile-copy-recursive-perl \
        libcgi-pm-perl \
        libdbi-perl \
        libdbd-mysql-perl \
        libxml-simple-perl \
        libmysqlclient-dev \
        default-libmysqlclient-dev \
        libgdal-dev \
        ## new libs
        libglpk-dev \
        ## Databases and other software
        sqlite \
        openmpi-bin \
        mpi-default-bin \
        openmpi-common \
        openmpi-doc \
        tcl8.6-dev \
        tk-dev \
        default-jdk \
        imagemagick \
        tabix \
        ggobi \
        graphviz \
        protobuf-compiler \
        jags \
        ## Additional resources
        xfonts-100dpi \
        xfonts-75dpi \
        biber \
        libsbml5-dev \
        && apt-get clean \
        && rm -rf /var/lib/apt/lists/*

#install R CRAN binary packages
RUN install2.r -e \
testthat

## Install remaining packages from source
COPY ./requirements-src.R .
RUN Rscript requirements-src.R

## Install Bioconductor packages
COPY ./requirements-bioc.R .
RUN apt-get update \
 && apt-get install -y --no-install-recommends \
   libfftw3-dev \
   gcc && apt-get clean \
 && rm -rf /var/lib/apt/lists/*
RUN Rscript -e 'requireNamespace("BiocManager"); BiocManager::install(ask=F);' \
&& Rscript requirements-bioc.R

## Install from GH the following
RUN installGithub.r theislab/kBET \
chris-mcginnis-ucsf/DoubletFinder \

Note that the CRAN packages from source and the Bioconductor packages are held in separate scripts in the same folder as your Dockerfile.

requirements-src.R:

pkgs <- c(
'spelling',
'english',
'Seurat')

install.packages(pkgs)

requirements-bioc.R:

bioc_pkgs<-c(
'biomaRt',
'SingleCellExperiment',
'SummarizedExperiment')

requireNamespace("BiocManager")
BiocManager::install(bioc_pkgs,ask=F)
A_Murphy
  • 184
  • 2
  • 14