What is the most efficient way to partition an input file in pyspark?

Question

I am reading in an input file using PySpark and I'm wondering what's the best way to repartition the input data so it can be spread out evenly across the Mesos cluster.

Currently, I'm doing:

rdd = sc.textFile('filename').repartition(10)

I was looking at sparkContext documentation and I noticed that textFile method has an option called minPartitions which is by default set to None.

I'm wondering if it will be more efficient if I specify my partition value there. For example:

rdd = sc.textFile('filename', 10)

I'm assuming/hoping it will eliminate the need for shuffle after the data has been read in, if I read in the file in chunks to begin with.

Do I understand it correctly? If not, what is the difference between the two methods (if any)?

score 1 · Accepted Answer · edited May 23 '17 at 12:16

There are two main differences between these methods:

repartition shuffles the data after loading while using minPartitions doesn't
repartition results in exact number of partitions while minPartitions provides only a lower bound (see Why does partition parameter of SparkContext.textFile not take effect?)

In general if you load data using textFile there should be no need to further repartition it to get roughly uniform distribution. Since input splits are computed based on amount of data all partitions should be already more or less of the same size. So the only reason to further modify number of partitions is to improve utilization of resources like memory or CPU cores.

What is the most efficient way to partition an input file in pyspark?

1 Answers1