I have 2 big parquet Dataframe
s and I want to join them on a userId
.
What should I do to get high performance :
Should I modify the code that write those files in order to :
partitionBy
on the userId (very sparse).partitionBy
on the first N char of the userId (afaik, If data are already partitioned on the same key, the join will occur with no shuffle)
On the read side, is it better to use RDD
or DataFrame
?