How to avoid "Not a file" exceptions when reading from HDFS with spark

Question

I copy a tree of files from S3 to HDFS with S3DistCP in an initial EMR step. hdfs dfs -ls -R hdfs:///data_dir shows the expected files, which look something like:

/data_dir/year=2015/
/data_dir/year=2015/month=01/
/data_dir/year=2015/month=01/day=01/
/data_dir/year=2015/month=01/day=01/data01.12345678
/data_dir/year=2015/month=01/day=01/data02.12345678
/data_dir/year=2015/month=01/day=01/data03.12345678

The 'directories' are listed as zero-byte files.

I then run a spark step which needs to read these files. The loading code is thus:

sqlctx.read.json('hdfs:///data_dir, schema=schema)

The job fails with a java exception

java.io.IOException: Not a file: hdfs://10.159.123.38:9000/data_dir/year=2015

I had (perhaps naively) assumed that spark would recursively descend the 'dir tree' and load the data files. If I point to S3 it loads the data successfully.

Am I misunderstanding HDFS? Can I tell spark to ignore zero-byte files? Can i use S3DistCp to flatten the tree?

I can't tell for DataFrame API, but in pure Spark, i.e. with RDD's, you can load all files in a directory with `SparkContext.wholeTextFiles(*path*)`, see [docs](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext) for more. — mehmetminanc, Oct 03 '15 at 15:35
How about "/data_dir/*/*/*/*.*" ? From my experience spark does not load files recursively in sub directories. — WoodChopper, Oct 03 '15 at 16:36
@WoodChopper Thanks, you are correct. The HDFS driver does not recursively descend the file hierarchy. — Rob Cowie, Oct 05 '15 at 10:48
yw :) I guess its just not HDFS even with local file system. — WoodChopper, Oct 05 '15 at 12:34

score 6 · Answer 1 · edited May 23 '17 at 12:08

In Hadoop configuration for current spark context, configure "recursive" read for Hadoop InputFormat before to get the sql ctx

val hadoopConf = sparkCtx.hadoopConfiguration
hadoopConf.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")

This will give the solution for "not a file". Next, to read multiple files:

Hadoop job taking input files from multiple directories

or union the list of files into single dataframe :

Read multiple files from a directory using Spark

score 3 · Answer 2 · answered Dec 12 '19 at 14:24

3

Problem solved with :

spark-submit ...
    --conf spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive=true \
    --conf spark.hive.mapred.supports.subdirectories=true \
    ...

answered Dec 12 '19 at 14:24

Indent

4,675
1
19
35

score 1 · Answer 3 · answered Jan 09 '18 at 18:31

1

The parameters must be set in this way in spark version 2.1.0 :

.set("spark.hive.mapred.supports.subdirectories","true")
.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true")

answered Jan 09 '18 at 18:31

Paul

7,155
8
41
40

How to avoid "Not a file" exceptions when reading from HDFS with spark

3 Answers3