4

While groupBy the dataframe in apache spark and later using aggregation with another column in the dataframe. Is there any performance issue? Can reduceBy be a better option?

df.groupBy("primaryKey").agg(max("another column"))
Community
  • 1
  • 1
Nsp
  • 43
  • 1
  • 5

1 Answers1

2

In groupBy, reduce job will execute sequentially but in reduceByKey, internally spark runs multiple reduce job in parallel as it knows key and run reduce against key. ReduceByKey gives better performance than groupBy. You can run aggregation on both.

Sagar balai
  • 479
  • 6
  • 13
  • 4
    I think you are confusing the RDD function `groupByKey` and the dataframe `groupBy`, they are quite different. The dataframe `groupBy` will aggregate locally first. – Shaido Mar 27 '18 at 05:49