0

We have two DataFrames: df_A, df_B

Let's say, both has a huge # of rows. And we need to partition them. How to partition them as couples?

For example, partition number is 5:

  • df_A partitions: partA_1, partA_2, partA_3, partA_4, partA_5
  • df_B partitions: partB_1, partB_2, partB_3, partB_4, partB_5

If we have 5 machines:

  • machine_1: partA_1 and partB_1
  • machine_2: partA_2 and partB_2
  • machine_3: partA_3 and partB_3
  • machine_4: partA_4 and partB_4
  • machine_5: partA_5 and partB_5

If we have 3 machine:

  • machine_1: partA_1 and partB_1
  • machine_2: partA_2 and partB_2
  • machine_3: partA_3 and partB_3
  • ...(when machines are free up)...
  • machine_1: partA_4 and partB_4
  • machine_2: partA_5 and partB_5

Note: If one of DataFrames is small enough, we can use broadcast technique.

What to do(how to partition) when both (or more than two) DataFrames are large enough?

David
  • 84
  • 6

1 Answers1

0

I think we need to take a step back here. Looking at big sizes aspect only, not broadcast.

Spark is a framework that manages things for your App in terms of co-location of dataframe partitions, taking into account resources allocated vs. resources available and the type of Action, and thus if Workers need to acquire partitions for processing.

repartitions are Transformations. When an Action, such as write:

peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet")

occurs then things kick in.

  • If you have a JOIN, then Spark will work out if re-partitioning and movement is required.
  • That is to say, if you join on c1 for both DF's, then re-partitioning may most likely well occur for the c1 column, so that occurrences in the DF's for that c1 column are shuffled to the same Nodes where a free Executor resides waiting to serve that JOIN of 2 or more partitions.
  • That only occurs when an Action is invoked. In this way, if you do unnecessary Transformation, Catalyst can obviate those things.
  • Also, for number of partitions used, this is a good link imho: spark.sql.shuffle.partitions of 200 default partitions conundrum
thebluephantom
  • 16,458
  • 8
  • 40
  • 83