I am reading in an input file using PySpark and I'm wondering what's the best way to repartition the input data so it can be spread out evenly across the Mesos cluster.
Currently, I'm doing:
rdd = sc.textFile('filename').repartition(10)
I was looking at sparkContext
documentation and I noticed that textFile
method has an option called minPartitions
which is by default set to None
.
I'm wondering if it will be more efficient if I specify my partition value there. For example:
rdd = sc.textFile('filename', 10)
I'm assuming/hoping it will eliminate the need for shuffle after the data has been read in, if I read in the file in chunks to begin with.
Do I understand it correctly? If not, what is the difference between the two methods (if any)?