groupByKey vs repartition performance

Question

When I do a

df.groupByKey("<column>").mapGroups((key,value) => myfunction(value))

vs

df.repartition("<column>").mapPartitions(...)

Would like to know which is more efficient when applied on a large DataFrames? What I know is both results in shuffle but repartition will make sure that data related to partitioned column will always be together on one worker node. Correct me if I'm wrong.

Without any further assumptions about the function you apply on groups / partitions you can assume that both are more or less as inefficient as it gets in Spark: - Both functions shuffle first, then apply the function. - Both are black boxes for the optimizer therefore it is not possible to push any operation specific optimizations into the shuffle internals. The only substantial difference is that the former one requires `DeserializeToObject` before shuffle, while the latter one doesn't. — zero323, Aug 31 '17 at 19:33
Also partitioning is 0..* - single partition can store multiple "keys". — zero323, Aug 31 '17 at 19:34
@zero323 the reason why im planning to use above code is to solve issue mentioned below . https://stackoverflow.com/questions/45950187/select-entire-row-based-on-a-logic-applied-on-2-columns-across-multiple-rows — shiv455, Aug 31 '17 at 19:55
This one is essentially a dupe of https://stackoverflow.com/q/33878370/1560062 with added `when(...).otherwise(...)` at the end - right? — zero323, Aug 31 '17 at 21:19
@zero323 when you say "this one" are you talking about https://stackoverflow.com/questions/45950187/select-entire-row-based-on-a-logic-applied-on-2-columns-across-multiple-rows ?if yes I didn't get response can you please elaborate — shiv455, Aug 31 '17 at 23:49
I mean https://stackoverflow.com/questions/45950187/select-entire-row-based-on-a-logic-applied-on-2-columns-across-multiple-rows is the same problem as https://stackoverflow.com/questions/45950187/select-entire-row-based-on-a-logic-applied-on-2-columns-across-multiple-rows. Find the record with the latest date using one of the methods above and `when(it_is_in_the_last_10_months, income).otherwise(sourced_income)` or whatever the condition is. — zero323, Sep 01 '17 at 08:54
@zero323 if you can post as answer with more details it would be helpful — shiv455, Sep 01 '17 at 14:27

groupByKey vs repartition performance

0 Answers0