TL;DR
Spark 1.6.1 fails to write a CSV file using Spark CSV 1.4 on a standalone cluster with no HDFS with IOException Mkdirs failed to create file
More details:
I'm working on a Spark 1.6.1 application running it on a standalone cluster using a local filesystem (the machine I'm running on doesn't even have HDFS on it) with Scala. I have this dataframe that I'm trying to save as a CSV file using HiveContext.
This is what I'm running:
exportData.write
.mode(SaveMode.Overwrite)
.format("com.databricks.spark.csv")
.option("delimiter", ",")
.save("/some/path/here") // no hdfs:/ or file:/ prefix in the path
The Spark CSV that I'm using is 1.4. When running this code I get the following exception:
WARN TaskSetManager:70 - Lost task 4.3 in stage 10.0: java.io.IOException: Mkdirs failed to create file: /some/path/here/_temporary/0
The full stacktrace is:
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1193)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The output dir does get created but its empty.
I tried running it using the spark shell, what I did is create a dummy dataframe and then save it using the exact same code to save (also to the same path). It succeeded.
I checked the permissions for the folder I'm writing to and changed it to 777 but basically it still doesn't work when running the Spark job
Googling it suggested:
- changing the file prefix by removing hdfs:/ which I don't have anyway. I also tried adding file:/, file://, file:/// prefix with no luck
- permissions issues - I tried solving this by making the folder 777
- some MacBook issue which is probably not relevant to me since I'm working on Ubuntu
- security issues - examining my stacktrace, I couldn't find any security failure.
- removing the / prefix at the beginning of my file path - I tried it as well with no luck
- other unanswered questions regarding this problem
Does anyone has any idea on what exactly is the problem? And how to overcome it?
Thanks in advance