join DataFrames within partitions in PySpark

Question

I have two dataframes with a large (millions to tens of millions) number of rows. I'd like to do a join between them.

In the BI system I'm currently using, you make this fast by first partitioning on a particular key, then doing the join on that key.

Is this a pattern that I need to be following in Spark, or does that not matter? It seems at first glance like a lot of time is wasted shuffling data between partitions, because it hasn't been pre-partitioned correctly.

If it is necessary, then how do I do that?

score 1 · Accepted Answer · answered Dec 26 '17 at 19:06

If it is necessary, then how do I do that?

How to define partitioning of DataFrame?

However it makes sense only under two conditions:

There multiple joins withing the same application. Partitioning shuffles itself, so if it is a single join there is no added value.
It is long lived application where shuffled data will be reused. Spark cannot take advantage of the partitioning of the data stored in the external format.

join DataFrames within partitions in PySpark

1 Answers1

Linked