We have a spark cluster which is built with the help of docker(singularities/spark image). When we remove containers, data which is stored in hdfs is removed. It is normal I know, but how can I solve the problem such that whenever I start cluster again, files in hdfs restore without upload again
Asked
Active
Viewed 63 times
-1
-
Possible duplicate of [I lose my data when the container exits](https://stackoverflow.com/questions/19585028/i-lose-my-data-when-the-container-exits) – David Maze Aug 15 '18 at 14:02
1 Answers
0
You can bind/mount a host volume as below for /opt/hdfs
directory for both master & worker -
version: "2"
services:
master:
image: singularities/spark
command: start-spark master
hostname: master
volumes:
- "${PWD}/hdfs:/opt/hdfs"
ports:
- "6066:6066"
- "7070:7070"
- "8080:8080"
- "50070:50070"
worker:
image: singularities/spark
command: start-spark worker master
volumes:
- "${PWD}/hdfs:/opt/hdfs"
environment:
SPARK_WORKER_CORES: 1
SPARK_WORKER_MEMORY: 2g
links:
- master
This way your HDFS files will always persist at ./hdfs
(hdfs
in current working directory) on the host machine.

vivekyad4v
- 13,321
- 4
- 55
- 63