How to minimize shuffling on Spark dataframe Join?

Asked Aug 10 '15 at 11:08

Active Aug 10 '15 at 11:08

Viewed 508 times

I have two dataframes like this

  student_rdf = (studentid, name, ...)
  student_result_rdf = (studentid, gpa, ...)

we need to join this two dataframes. we are now doing like this,

student_rdf.join(student_result_rdf, student_result_rdf["studentid"] == student_rdf["studentid"])

So it is simple. But it creates lots of data shuffling across worker nodes, but as joining key is similar and if the dataframe could (understand the partitionkey) be partitioned using that key (studentid) then there suppose not to be any shuffling at all. As similar data would reside in similar node. is it possible?

I am finding the way to partition data based on a column while I read a dataframe from input.And If it is possible that Spark would understand that two partitionkey of two dataframes are similar, then how?

asked Aug 10 '15 at 11:08

Zer001

Did you get the answer for this? – vikrant rana May 14 '19 at 07:41
Possible duplicate of [Partition data for efficient joining for Spark dataframe/dataset](https://stackoverflow.com/questions/48160627/partition-data-for-efficient-joining-for-spark-dataframe-dataset) – bsplosion May 22 '19 at 14:39

How to minimize shuffling on Spark dataframe Join?

0 Answers0