My s3 bucket has 2 level nested directories (level 1 ~6000 directories, level 2 10-500 directories).
the problem is that when reading it with spark e.g new SQLContext(sc).read.parquet(path)
I'm getting slowdowns from s3 because of a massive amount of calls when listing the files.
I saw this post here that is dealing with a patch for that issue: Spark lists all leaf node even in partitioned data and this issue here: https://issues.apache.org/jira/browse/HADOOP-13208
I was wondering if someone tried it successfully because I'm using Hadoop 2.9 and I'm still having this issue.
Steve Loughran, if you can respond to it I would be very thankful.