1

I have a docker compose file that contains the below volume mapping.

volumes:
    - /opt/cloudera/parcels/SPARK2/lib/spark2:/opt/cloudera/parcels/SPARK2/lib/spark2

The contents of this directory are:

rwxr-xr-x 13 root root   247 Nov 30 16:39 .
drwxr-xr-x  3 root root    20 Jan  9  2018 ..
drwxr-xr-x  2 root root  4096 Jan  9  2018 bin
drwxr-xr-x  2 root root    39 Jan  9  2018 cloudera
lrwxrwxrwx  1 root root    16 Jan  9  2018 conf -> /etc/spark2/conf ***
drwxr-xr-x  5 root root    50 Jan  9  2018 data
drwxr-xr-x  4 root root    29 Jan  9  2018 examples
drwxr-xr-x  2 root root  8192 May 22  2018 jars
drwxr-xr-x  2 root root   204 Jan  9  2018 kafka-0.10
drwxr-xr-x  2 root root   201 Jan  9  2018 kafka-0.9
-rw-r--r--  1 root root 17881 Jan  9  2018 LICENSE
drwxr-xr-x  2 root root  4096 Jan  9  2018 licenses
-rw-r--r--  1 root root 24645 Jan  9  2018 NOTICE
drwxr-xr-x  6 root root   204 Jan  9  2018 python
-rw-r--r--  1 root root  3809 Jan  9  2018 README.md
-rw-r--r--  1 root root   313 Jan  9  2018 RELEASE
drwxr-xr-x  2 root root  4096 Jan  9  2018 sbin
lrwxrwxrwx  1 root root    20 Jan  9  2018 work -> /var/run/spark2/work
drwxr-xr-x  2 root root    52 Jan  9  2018 yarn

Of note is the starred conf directory, which itself is a series of symbolic links which eventually point to to the /etc/spark2/conf.cloudera.spark2_on_yarn folder that contains:

drwxr-xr-x 3 root  root    194 Nov 30 16:39 .
drwxr-xr-x 3 root  root     54 Nov 12 14:45 ..
-rw-r--r-- 1 root  root  13105 Sep 16 03:07 classpath.txt
-rw-r--r-- 1 root  root     20 Sep 16 03:07 __cloudera_generation__
-rw-r--r-- 1 root  root    148 Sep 16 03:07 __cloudera_metadata__
-rw-r--r-- 1 ember 10000  2060 Nov 30 16:33 envars.test
-rw-r--r-- 1 root  root    951 Sep 16 03:07 log4j.properties
-rw-r--r-- 1 root  root   1837 Sep 16 03:07 spark-defaults.conf
-rw-r--r-- 1 root  root   2331 Sep 16 03:07 spark-env.sh
drwxr-xr-x 2 root  root    242 Sep 16 03:07 yarn-conf

When mapping the spark2 directory, only the yarn-conf subfolder shows up, the spark-env.sh file and other files are absent.

Is it the series of symbolic links that is causing these files to be absent? If so, do I need to explicitly set a mapping for every single folder in order to get all of the necessary dependencies to appear? I was under the impression that docker-compose volumes would recursively mount all files/folders under a particular directory.

mongolol
  • 941
  • 1
  • 13
  • 31

2 Answers2

2

The bind mount should faithfully reproduce the contents of the host: conf inside the container should be a symbolic link to /etc/spark2/conf. The container may or may not have anything at that path, but Docker doesn't recursively search the bind-mounted tree and try to do anything special with symlinks.

Are you trying to use docker run -v to "install" a Spark distribution in your container? You might be better off building a standalone Docker image with the software you want, and then using a bind mount to only inject the config files. That could look something like

docker run \
  -v /etc/spark2/conf:/spark/conf \
  -v $PWD/spark:/spark/work \
  mysparkimage
David Maze
  • 130,717
  • 29
  • 175
  • 215
  • Well, the end goal is to have the Spark binaries/confs mounted within the docker so that from inside the container we can submit jobs to the cluster. Changes in the hadoop/yarn/spark configs on the host need to be reflected with the container. The issue with shipping a container with a version of Spark that matches that on the host, and then just mounting the confs, is that upgrades to the Spark version may not be reflected. However, as this is a rare case, that may be acceptable. Exploding out every symlink seems to be the other alternative. – mongolol Dec 04 '18 at 14:56
  • If you want the exact same software as on the host, and the exact same configuration as on the host...what are you getting from Docker? – David Maze Dec 04 '18 at 15:44
  • We have an application which runs inside of the container and orchestrates Spark jobs. The main purpose of Docker is deploying this application, which, if the deployment is on a gateway node of a cluster, needs to be cognizant of the greater Spark ecosystem. The goal was to just use the host's version of Spark so that we wouldn't have to ship it (by mounting both the confs and binaries). However, it may be worth it just to provide builds that match the host's Spark version, and then just mount the confs. – mongolol Dec 04 '18 at 20:01
1

Possible duplication of this question. In short, symlinks don't work very well inside docker containers.

pew007
  • 453
  • 2
  • 7
  • 17