3

Let's say we have two very large data frames - A and B. Now, I understand if I use same hash partitioner for both RDDs and then do the join, the keys will be co-located and the join might be faster with reduced shuffling (the only shuffling that will happen will be when the partitioner changes on A and B).

I wanted to try something different though - I want to try broadcast join like so -> let's say the B is smaller than A so we pick B to broadcast but B is still a very big dataframe. So, what we want to do is to make multiple data frames out of B and then send each as broadcast to be joined on A.

Has anyone tried this? To split one data frame into many I am only seeing randomSplit method but that doesn't look so great an option.

Any other better way to accomplish this task?

Thanks!

Kumar Vaibhav
  • 2,632
  • 8
  • 32
  • 54

1 Answers1

5

Has anyone tried this?

Yes, someone already tried that. In particular GoDataDriven. You can find details below:

They claim pretty good results for skewed data, however there are three problems you have to consider doing this yourself:

  • There is no split in Spark. You have to filter data multiple times or eagerly cache complete partitions (How do I split an RDD into two or more RDDs?) to imitate "splitting".
  • Huge advantage of broadcast is reduction in the amount of transferred data. If data is large, then amount of data to be transferred can actually significantly increase: (Why my BroadcastHashJoin is slower than ShuffledHashJoin in Spark)
  • Each "join" increases complexity of the execution plan and with long series of transformations things can get really slow on the driver side.

randomSplit method but that doesn't look so great an option.

It is actually not a bad one.

Any other better way to accomplish this task?

You may try to filter by partition id.