1

I'm reading in tens of thousands of files into an rdd via something like sc.textFile("/data/*/*/*") One problem is that most of these files are tiny, whereas others are huge. That leads to imbalanced tasks, which causes all sorts of well-known problems.

Can I break up the largest partitions by instead reading in my data via sc.textFile("/data/*/*/*", minPartitions=n_files*5), where n_files is the number of input files?

As convered elsewhere on stackoverflow, minPartitions gets passed way down the hadoop rabit hole and is used in the org.apache.hadoop.mapred.TextInputFormat.getSplits. My question is whether this is implemented such that the largest files are split first. In other words, is the splitting strategy one that tries to lead to evenly sized partitions?

I would prefer an answer that points to wherever the splitting strategy is actually implemented in a recent version of spark/hadoop.

Community
  • 1
  • 1
conradlee
  • 12,985
  • 17
  • 57
  • 93

1 Answers1

1

Nobody's posted an answer so I dug into this myself and will post an answer to my own question:

It appears that, if your input file(s) are splittable, then textFile will indeed try to balance partition size if you use the minPartitions option.

The partitioning strategy is implemented here, i.e., in the getSplits method of org.apache.hadoop.mapred.TextInputFormat. This partitioning strategy is complex, and operates by first setting goalSize, which is simply the total size of the input divided by the numSplits (minPartitions is passed down to set the value of numSplits). It then splits up files in such a way that tries to ensure that each partition's size (in terms of its input's byte size) is as close as possible to the goalSize/

If your input file(s) are not splittable, then this splitting will not take place: see the source code here.

conradlee
  • 12,985
  • 17
  • 57
  • 93