-1

We have a spark cluster which is built with the help of docker(singularities/spark image). When we remove containers, data which is stored in hdfs is removed. It is normal I know, but how can I solve the problem such that whenever I start cluster again, files in hdfs restore without upload again

ugur
  • 400
  • 6
  • 20
  • Possible duplicate of [I lose my data when the container exits](https://stackoverflow.com/questions/19585028/i-lose-my-data-when-the-container-exits) – David Maze Aug 15 '18 at 14:02

1 Answers1

0

You can bind/mount a host volume as below for /opt/hdfs directory for both master & worker -

version: "2"

services:
  master:
    image: singularities/spark
    command: start-spark master
    hostname: master
    volumes:
      - "${PWD}/hdfs:/opt/hdfs"
    ports:
      - "6066:6066"
      - "7070:7070"
      - "8080:8080"
      - "50070:50070"
  worker:
    image: singularities/spark
    command: start-spark worker master
    volumes:
      - "${PWD}/hdfs:/opt/hdfs"
    environment:
      SPARK_WORKER_CORES: 1
      SPARK_WORKER_MEMORY: 2g
    links:
      - master

This way your HDFS files will always persist at ./hdfs(hdfs in current working directory) on the host machine.

Ref - https://hub.docker.com/r/singularities/spark/

vivekyad4v
  • 13,321
  • 4
  • 55
  • 63