While groupBy
the dataframe in apache spark and later using aggregation with another column in the dataframe. Is there any performance issue? Can reduceBy
be a better option?
df.groupBy("primaryKey").agg(max("another column"))
While groupBy
the dataframe in apache spark and later using aggregation with another column in the dataframe. Is there any performance issue? Can reduceBy
be a better option?
df.groupBy("primaryKey").agg(max("another column"))
In groupBy, reduce job will execute sequentially but in reduceByKey, internally spark runs multiple reduce job in parallel as it knows key and run reduce against key. ReduceByKey gives better performance than groupBy. You can run aggregation on both.