0

My s3 bucket has 2 level nested directories (level 1 ~6000 directories, level 2 10-500 directories). the problem is that when reading it with spark e.g new SQLContext(sc).read.parquet(path) I'm getting slowdowns from s3 because of a massive amount of calls when listing the files.

I saw this post here that is dealing with a patch for that issue: Spark lists all leaf node even in partitioned data and this issue here: https://issues.apache.org/jira/browse/HADOOP-13208

I was wondering if someone tried it successfully because I'm using Hadoop 2.9 and I'm still having this issue.

Steve Loughran, if you can respond to it I would be very thankful.

psyduck
  • 113
  • 6
  • > Steve Loughran, if you can respond to it I would be very thankful. other than "it's a bit presumptive to ask like that", and "I've discussed directory layouts elsewhere" my answer is "upgrade to hadoop-3.2 – stevel Apr 14 '20 at 12:19
  • @SteveLoughran Thank you for the response! as far as I know, there isn't a spark version that is supported in hadoop 3.2 – psyduck Apr 19 '20 at 18:07
  • 1
    there is if you build it yourself; use the hadoop-3.2 profile, that is maven with the settings: mvn install -DskipTests -Phive -Pyarn -Phadoop-3.2 -Phadoop-cloud ... don't be afraid of building your own – stevel Apr 22 '20 at 16:51

0 Answers0