I am trying to optimise my spark application job.
I tried to understand the points from this question: How to avoid shuffles while joining DataFrames on unique keys?
I have made sure that the keys on which join operation has to happen are distributed within the same partition (using my custom partitioner).
I also cannot do a broadcast join because my data may be come large depending on situation.
In the answer of above mentioned question, repartitioning only optimises the join but What I need is join WITHOUT A SHUFFLE. I am just fine with the join operation with the help of keys within the partition.
Is it possible? I want to implement something like joinperpartition if similar functionality does not exists.