In the spark, I understand how to use wholeTextFiles
and textFiles
, but I'm not sure which to use when. Here is what I know so far:
- When dealing with files that are not split by line, one should use
wholeTextFiles
, otherwise usetextFiles
.
I would think that by default, wholeTextFiles
and textFiles
partition by file content, and by lines, respectively. But, both of them allow you to change the parameter minPartitions
.
So, how does changing the partitions affect how these are processed?
For example, say I have one very large file with 100 lines. What would be the difference between processing it as wholeTextFiles
with 100 partiions, and processing it as textFile
(which partitions it line by line) using the default of parition 100.
What is the difference between these?