should we use groupBy on dataframe or reduceBy

Question

While groupBy the dataframe in apache spark and later using aggregation with another column in the dataframe. Is there any performance issue? Can reduceBy be a better option?

df.groupBy("primaryKey").agg(max("another column"))

No this question is clearly not a duplicate of "DataFrame / Dataset groupBy behaviour/optimization", and deserve being reopened. — Marc Le Bihan, Mar 17 '21 at 18:46

score 2 · Accepted Answer · answered Mar 27 '18 at 05:41

2

In groupBy, reduce job will execute sequentially but in reduceByKey, internally spark runs multiple reduce job in parallel as it knows key and run reduce against key. ReduceByKey gives better performance than groupBy. You can run aggregation on both.

answered Mar 27 '18 at 05:41

Sagar balai

479
6
13

4

I think you are confusing the RDD function `groupByKey` and the dataframe `groupBy`, they are quite different. The dataframe `groupBy` will aggregate locally first. – Shaido Mar 27 '18 at 05:49

should we use groupBy on dataframe or reduceBy

1 Answers1