Is GroupByKey function in Spark that bad?

Question

GroupBy vs ReduceBy: Is GroupBy that Bad? If GroupByKey is that bad and it results in same output as ReduceByKey then why did spark created this function? There should be a usecase where GroupByKey which consumes more network bandwidth and more shuffling but still being useful under certain circumstances over ReduceBy and AggregateBy. If not useful at all then this Function should be removed from Spark in upcoming releases???

Did the answer help you? – thebluephantom Jan 03 '19 at 11:57 — thebluephantom, Jan 03 '19 at 11:57

thebluephantom · Answer 1 · 2019-01-03T10:15:27.230

Yes, it is. See this excellent link https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html

ReduceByKey minimizes Shuffling like Hadoop Combiner does.

That said, groupByKey is necessary at times, but try to apply to massaged data. As the link shows, groupByKey needs extra logic if used for summing, implying groupByKey should not be removed from Spark, for RDDs.

For Dataframes there is groupBy, with Catalyst under the hood optimizations.

Is GroupByKey function in Spark that bad?

1 Answers1