We have two DataFrames: df_A, df_B
Let's say, both has a huge # of rows. And we need to partition them. How to partition them as couples?
For example, partition number is 5:
- df_A partitions: partA_1, partA_2, partA_3, partA_4, partA_5
- df_B partitions: partB_1, partB_2, partB_3, partB_4, partB_5
If we have 5 machines:
- machine_1: partA_1 and partB_1
- machine_2: partA_2 and partB_2
- machine_3: partA_3 and partB_3
- machine_4: partA_4 and partB_4
- machine_5: partA_5 and partB_5
If we have 3 machine:
- machine_1: partA_1 and partB_1
- machine_2: partA_2 and partB_2
- machine_3: partA_3 and partB_3
- ...(when machines are free up)...
- machine_1: partA_4 and partB_4
- machine_2: partA_5 and partB_5
Note: If one of DataFrames is small enough, we can use broadcast technique.
What to do(how to partition) when both (or more than two) DataFrames are large enough?