In Spark, is there an equivalent of the aggregateByKey method available for RDD in the DataFrame API?
-
1For the dataframe API, use `groupBy`. – Shaido May 28 '19 at 02:02
1 Answers
Most common aggregation operations in the DataFrame interface can be done with agg
and an already-defined aggregator e.g. sum
, first
, max
, etc. If you're looking to do something like a GROUP BY
and aggregation, a la SQL, you should look into those existing aggregation functions first.
The aggregateByKey
method exposes more complex logic, however, which allows you to implement some sophisticated aggregation routines. If you're looking to do that, you'll want to use the Dataset interface, which is very similar to what you're already used to from RDDs. Specifically, look into creating a custom aggregator:
https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html
There, you define the aggregator methods like initialize
, merge
, etc. that specify how to create the aggregator, merge individual elements into the aggregate, and combine intermediate aggregates together across executors/tasks.
Once your aggregator is defined, you can use it on a Dataset, e.g.
ds.groupBy(_.myKey).agg(myCustomAggregator)

- 132
- 1
- 6