4

When I do a

df.groupByKey("<column>").mapGroups((key,value) => myfunction(value))

vs

df.repartition("<column>").mapPartitions(...)

Would like to know which is more efficient when applied on a large DataFrames? What I know is both results in shuffle but repartition will make sure that data related to partitioned column will always be together on one worker node. Correct me if I'm wrong.

zero323
  • 322,348
  • 103
  • 959
  • 935
shiv455
  • 7,384
  • 19
  • 54
  • 93
  • 1
    Without any further assumptions about the function you apply on groups / partitions you can assume that both are more or less as inefficient as it gets in Spark: - Both functions shuffle first, then apply the function. - Both are black boxes for the optimizer therefore it is not possible to push any operation specific optimizations into the shuffle internals. The only substantial difference is that the former one requires `DeserializeToObject` before shuffle, while the latter one doesn't. – zero323 Aug 31 '17 at 19:33
  • 1
    Also partitioning is 0..* - single partition can store multiple "keys". – zero323 Aug 31 '17 at 19:34
  • @zero323 the reason why im planning to use above code is to solve issue mentioned below . https://stackoverflow.com/questions/45950187/select-entire-row-based-on-a-logic-applied-on-2-columns-across-multiple-rows – shiv455 Aug 31 '17 at 19:55
  • This one is essentially a dupe of https://stackoverflow.com/q/33878370/1560062 with added `when(...).otherwise(...)` at the end - right? – zero323 Aug 31 '17 at 21:19
  • @zero323 when you say "this one" are you talking about https://stackoverflow.com/questions/45950187/select-entire-row-based-on-a-logic-applied-on-2-columns-across-multiple-rows ?if yes I didn't get response can you please elaborate – shiv455 Aug 31 '17 at 23:49
  • I mean https://stackoverflow.com/questions/45950187/select-entire-row-based-on-a-logic-applied-on-2-columns-across-multiple-rows is the same problem as https://stackoverflow.com/questions/45950187/select-entire-row-based-on-a-logic-applied-on-2-columns-across-multiple-rows. Find the record with the latest date using one of the methods above and `when(it_is_in_the_last_10_months, income).otherwise(sourced_income)` or whatever the condition is. – zero323 Sep 01 '17 at 08:54
  • @zero323 if you can post as answer with more details it would be helpful – shiv455 Sep 01 '17 at 14:27

0 Answers0