How to create partitions for two dataframes while couple of partitions can be located on the same instance/machine on Spark?

Question

We have two DataFrames: df_A, df_B

Let's say, both has a huge # of rows. And we need to partition them. How to partition them as couples?

For example, partition number is 5:

df_A partitions: partA_1, partA_2, partA_3, partA_4, partA_5
df_B partitions: partB_1, partB_2, partB_3, partB_4, partB_5

If we have 5 machines:

machine_1: partA_1 and partB_1
machine_2: partA_2 and partB_2
machine_3: partA_3 and partB_3
machine_4: partA_4 and partB_4
machine_5: partA_5 and partB_5

If we have 3 machine:

machine_1: partA_1 and partB_1
machine_2: partA_2 and partB_2
machine_3: partA_3 and partB_3
...(when machines are free up)...
machine_1: partA_4 and partB_4
machine_2: partA_5 and partB_5

Note: If one of DataFrames is small enough, we can use broadcast technique.

What to do(how to partition) when both (or more than two) DataFrames are large enough?

score 0 · Answer 1 · answered May 23 '21 at 08:42

I think we need to take a step back here. Looking at big sizes aspect only, not broadcast.

Spark is a framework that manages things for your App in terms of co-location of dataframe partitions, taking into account resources allocated vs. resources available and the type of Action, and thus if Workers need to acquire partitions for processing.

repartitions are Transformations. When an Action, such as write:

peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet")

occurs then things kick in.

If you have a JOIN, then Spark will work out if re-partitioning and movement is required.
That is to say, if you join on c1 for both DF's, then re-partitioning may most likely well occur for the c1 column, so that occurrences in the DF's for that c1 column are shuffled to the same Nodes where a free Executor resides waiting to serve that JOIN of 2 or more partitions.
That only occurs when an Action is invoked. In this way, if you do unnecessary Transformation, Catalyst can obviate those things.
Also, for number of partitions used, this is a good link imho: spark.sql.shuffle.partitions of 200 default partitions conundrum

How to create partitions for two dataframes while couple of partitions can be located on the same instance/machine on Spark?

1 Answers1