How does repartitioning on a column in pyspark affect the number of partitions?

Question

I have a dataframe having a million records. It looks like this -

df.show()

+--------------------+--------------------++-------------
|            feature1|            feature2| domain    |
+--------------------+--------------------++-------------
|[2.23668528E8, 1....|[2.23668528E8, 1....| domain1   | 
|[2.23668528E8, 1....|[2.23668528E8, 1....| domain2   |
|[2.23668528E8, 1....|[2.23668528E8, 1....| domain1   |
|[2.23668528E8, 1....|[2.23668528E8, 1....| domain2   |
|[2.23668528E8, 1....|[2.23668528E8, 1....| domain1   |

Ideal partition size is 128 MB in spark and let's suppose the domain column has two unique values (domain1 and domain2), Considering this I have two questions -

If I do df.repartition("domain") and if one partition is not able to accommodate all the data for a particular domain key, will the application fail or will it automatically create partitions as suited depending on the data?
Suppose in the above data repartitioning has already happened based on the domain key so there will be two partitions (unique keys are domain1 and domain2). Now let's say domain1 and domain2 are repeated 1000000 times and I am going to do self-join based on the domain. So for each domain I will be getting approx 10^12 records. Considering that we have two partitions and the number of partitions doesn't change during the joins, will the two new partitions be able to handle 1000000 records?

score 2 · Accepted Answer · answered Dec 12 '18 at 09:57

The answer depends on the size of your data. When one partition is not able to hold all the data belonging to one partition value (e.g. domain1), more partitions will be created, at most spark.sql.shuffle.partitions many. If your data is too large, i.e. one partition would exceed the limit of 2GB (see also Why does Spark RDD partition has 2GB limit for HDFS? for an explanation on that), the repartitioning will cause an OutOfMemoryError.
Just as a side note to deliver a complete answer: Being able to fit the data into one partition does not necessarily entail that only one partition is generated for a partition value. This depends - among others - on the number of executors and how the data was partitioned before. Spark will try to avoid unnecessary shuffling and could therefore generate several partitions for one partition value.

Thus, to prevent the job from failing you should adjust spark.sql.shuffle.partitions or pass the desired number of partitions to repartition together with the partition column.

How does repartitioning on a column in pyspark affect the number of partitions?

1 Answers1