1

I have two dataframes with a large (millions to tens of millions) number of rows. I'd like to do a join between them.

In the BI system I'm currently using, you make this fast by first partitioning on a particular key, then doing the join on that key.

Is this a pattern that I need to be following in Spark, or does that not matter? It seems at first glance like a lot of time is wasted shuffling data between partitions, because it hasn't been pre-partitioned correctly.

If it is necessary, then how do I do that?

Daniel
  • 2,032
  • 5
  • 21
  • 27

1 Answers1

1

If it is necessary, then how do I do that?

How to define partitioning of DataFrame?

However it makes sense only under two conditions:

  • There multiple joins withing the same application. Partitioning shuffles itself, so if it is a single join there is no added value.
  • It is long lived application where shuffled data will be reused. Spark cannot take advantage of the partitioning of the data stored in the external format.