When I do a
df.groupByKey("<column>").mapGroups((key,value) => myfunction(value))
vs
df.repartition("<column>").mapPartitions(...)
Would like to know which is more efficient when applied on a large DataFrames
? What I know is both results in shuffle but repartition will make sure that data related to partitioned column will always be together on one worker node. Correct me if I'm wrong.