Spark `FileAlreadyExistsException` when `saveAsTextFile` even though the output directory doesn't exist

Question

I am running this command line:

hadoop fs -rm -r /tmp/output

And then a Java8 spark job with this main()

    SparkConf sparkConf = new SparkConf();
    JavaSparkContext sc = new JavaSparkContext(sparkConf);
    JavaRDD<JSONObject> rdd = sc.textFile("/tmp/input")
            .map (s -> new JSONObject(s))
    rdd.saveAsTextFile("/tmp/output");
    sc.stop();

And I get this error:

ERROR ApplicationMaster: User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /tmp/output already exists

Any idea how to fix it ?

I have used the following command in the SparkConf and it works perfectly well `yourSparkConf.set("spark.hadoop.validateOutputSpecs", "false")` — ypriverol, Oct 25 '17 at 09:37

score 0 · Accepted Answer · edited Jun 16 '16 at 00:34

0

You remove HDFS directory but Spark try to save in local file system.

To save in hdfs try this:

rdd.saveAsTextFile("hdfs://<URL-hdfs>:<PORT-hdfs>/tmp/output");

defaults for localhost is:

rdd.saveAsTextFile("hdfs://localhost:9000/tmp/output");

Other solution is remove /tmp/output from your local file system

Best regards

edited Jun 16 '16 at 00:34

avr

4,835
1
19
30

answered Feb 15 '16 at 10:46

DanielVL

249
1
5

Spark `FileAlreadyExistsException` when `saveAsTextFile` even though the output directory doesn't exist

1 Answers1