I'm reading in tens of thousands of files into an rdd via something like sc.textFile("/data/*/*/*")
One problem is that most of these files are tiny, whereas others are huge. That leads to imbalanced tasks, which causes all sorts of well-known problems.
Can I break up the largest partitions by instead reading in my data via sc.textFile("/data/*/*/*", minPartitions=n_files*5)
, where n_files
is the number of input files?
As convered elsewhere on stackoverflow, minPartitions
gets passed way down the hadoop rabit hole and is used in the org.apache.hadoop.mapred.TextInputFormat.getSplits
. My question is whether this is implemented such that the largest files are split first. In other words, is the splitting strategy one that tries to lead to evenly sized partitions?
I would prefer an answer that points to wherever the splitting strategy is actually implemented in a recent version of spark/hadoop.