0

I'm using spark-worker container, which is based on spark-base container.

How can I solve the exception:

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/README.md

Main.java

        context = new SparkContext(
                new SparkConf()
                        .setAppName("Test App")
                        .setMaster("spark://spark-master:7077")
                        .set("spark.executor.memory", "1g")
                        .setJars(new String[] { "target/spark-docker-1.0-SNAPSHOT.jar" })
        );

        String path = "file:///README.md";

        // EXCEPTION HERE!!!
        List<Tuple2<String, Integer>> output = context.textFile(path, 2) 
         ...

My Docker containers does not set up HDFS, so I hope they will work with local file system of each spark-worker. I did on each worker:

shell> docker exec -it spark-worker-# bash
shell> touch README.md

docker-compose.yml

# No HDFS or file system configurations!

version: '3.3'
services:
  spark-master:
    image: bde2020/spark-master
    container_name: spark-master
    ports: ['8080:8080', '7077:7077', '6066:6066']
  spark-worker-1:
    image: bde2020/spark-worker
    container_name: spark-worker-1
    ports: ['8082:8081']
    depends_on:
      - spark-master
    environment:
      - "SPARK_MASTER=spark://spark-master:7077"
  spark-worker-2:
    image: bde2020/spark-worker
    container_name: spark-worker-2
    ports: ['8083:8081']
    depends_on:
      - spark-master
    environment:
      - "SPARK_MASTER=spark://spark-master:7077"
halfer
  • 19,824
  • 17
  • 99
  • 186
VB_
  • 45,112
  • 42
  • 145
  • 293

1 Answers1

0

Spark can work with local files but it means you have to provide a copy of the file on each node in the cluster (including driver).

Additionally "file:///README.md" is a path in the root directory of the file system so make sure this is where you create the file, and that the user has correct access rights.

The easiest way to use local files is to just distribute it with SparkFiles

Also remember that correct writes require distributed storage - Saving dataframe to local file system results in empty results

If you want to support both writes and reads just use Docker volume shared between workers.