5

In the spark, I understand how to use wholeTextFiles and textFiles, but I'm not sure which to use when. Here is what I know so far:

  • When dealing with files that are not split by line, one should use wholeTextFiles, otherwise use textFiles.

I would think that by default, wholeTextFiles and textFiles partition by file content, and by lines, respectively. But, both of them allow you to change the parameter minPartitions.

So, how does changing the partitions affect how these are processed?

For example, say I have one very large file with 100 lines. What would be the difference between processing it as wholeTextFiles with 100 partiions, and processing it as textFile (which partitions it line by line) using the default of parition 100.

What is the difference between these?

makansij
  • 9,303
  • 37
  • 105
  • 183
  • Related: http://stackoverflow.com/questions/27989617/spark-sc-wholetextfiles-takes-a-long-time-to-execute – makansij Nov 25 '15 at 01:25

1 Answers1

6

For reference, wholeTextFiles uses WholeTextFileInputFormat which extends CombineFileInputFormat.

A couple of notes on wholeTextFiles.

  • Each record in the RDD returned by wholeTextFiles has the file name and the entire contents of the file. This means that a file cannot be split (at all).
  • Because it extends CombineFileInputFormat, it will try to combine groups of smaller files into one partition.

If I have two small files in a directory, it is possible that both files will end up in a single partition. If I set minPartitions=2, then I will likely get two partitions back instead.

Now if I were to set minPartitions=3, I will still get back two partitions because the contract for wholeTextFiles is that each record in the RDD contain an entire file.

Mike Park
  • 10,845
  • 2
  • 34
  • 50
  • Thanks for your answer. So, let me make sure I understand correctly: the `InputFormat`, `splitability`, and `minPartitions` only affect how the file is converted into an RDD. Correct? Then, after reading the input it is recommended that a particularly large `unsplittable` file be repartitioned using `repartition`, which will split the file based on hadoop `blocks`. True? – makansij Dec 12 '15 at 23:17
  • Almost correct. Sparks `repartition` uses an internal shuffling framework that does not take into account HDFS blocks or any specifics of the underlying storage. – Mike Park Dec 14 '15 at 17:03
  • Okay, so then whatever value I give for `repartition` should partition by the number that I provide it? – makansij Dec 16 '15 at 04:33
  • JavaPairRDD wholeTextFiles = jssc.sparkContext().wholeTextFiles(args[0]); wholeTextFiles.repartition(3); WholeTextFiles is not being re-partitioned. Though i specify 3 partitions, it's taking only two partitions. – AKC Apr 10 '17 at 20:04
  • Your answer saved me. Thanks a lot. – Daebarkee May 11 '18 at 10:53