I have a dataframe having a million records. It looks like this -
df.show()
+--------------------+--------------------++-------------
| feature1| feature2| domain |
+--------------------+--------------------++-------------
|[2.23668528E8, 1....|[2.23668528E8, 1....| domain1 |
|[2.23668528E8, 1....|[2.23668528E8, 1....| domain2 |
|[2.23668528E8, 1....|[2.23668528E8, 1....| domain1 |
|[2.23668528E8, 1....|[2.23668528E8, 1....| domain2 |
|[2.23668528E8, 1....|[2.23668528E8, 1....| domain1 |
Ideal partition size is 128 MB in spark and let's suppose the domain column has two unique values (domain1 and domain2), Considering this I have two questions -
If I do
df.repartition("domain")
and if one partition is not able to accommodate all the data for a particular domain key, will the application fail or will it automatically create partitions as suited depending on the data?Suppose in the above data repartitioning has already happened based on the domain key so there will be two partitions (unique keys are domain1 and domain2). Now let's say domain1 and domain2 are repeated 1000000 times and I am going to do self-join based on the domain. So for each domain I will be getting approx 10^12 records. Considering that we have two partitions and the number of partitions doesn't change during the joins, will the two new partitions be able to handle 1000000 records?