I copy a tree of files from S3 to HDFS with S3DistCP in an initial EMR step. hdfs dfs -ls -R hdfs:///data_dir
shows the expected files, which look something like:
/data_dir/year=2015/
/data_dir/year=2015/month=01/
/data_dir/year=2015/month=01/day=01/
/data_dir/year=2015/month=01/day=01/data01.12345678
/data_dir/year=2015/month=01/day=01/data02.12345678
/data_dir/year=2015/month=01/day=01/data03.12345678
The 'directories' are listed as zero-byte files.
I then run a spark step which needs to read these files. The loading code is thus:
sqlctx.read.json('hdfs:///data_dir, schema=schema)
The job fails with a java exception
java.io.IOException: Not a file: hdfs://10.159.123.38:9000/data_dir/year=2015
I had (perhaps naively) assumed that spark would recursively descend the 'dir tree' and load the data files. If I point to S3 it loads the data successfully.
Am I misunderstanding HDFS? Can I tell spark to ignore zero-byte files? Can i use S3DistCp to flatten the tree?