@zero323 nailed it, but I thought I'd add a bit more (low-level) background on how this minPartitions
input parameter influences the number of partitions.
tl;dr The partition parameter does have an effect on SparkContext.textFile
as the minimum (not the exact!) number of partitions.
In this particular case of using SparkContext.textFile, the number of partitions are calculated directly by org.apache.hadoop.mapred.TextInputFormat.getSplits(jobConf, minPartitions) that is used by textFile
. TextInputFormat
only knows how to partition (aka split) the distributed data with Spark only following the advice.
From Hadoop's FileInputFormat's javadoc:
FileInputFormat is the base class for all file-based InputFormats. This provides a generic implementation of getSplits(JobConf, int). Subclasses of FileInputFormat can also override the isSplitable(FileSystem, Path) method to ensure input-files are not split-up and are processed as a whole by Mappers.
It is a very good example how Spark leverages Hadoop API.
BTW, You may find the sources enlightening ;-)