0

In Spark, is there an equivalent of the aggregateByKey method available for RDD in the DataFrame API?

RDD aggregateByKey API doc

Benjamin
  • 3,350
  • 4
  • 24
  • 49

1 Answers1

4

Most common aggregation operations in the DataFrame interface can be done with agg and an already-defined aggregator e.g. sum, first, max, etc. If you're looking to do something like a GROUP BY and aggregation, a la SQL, you should look into those existing aggregation functions first.

The aggregateByKey method exposes more complex logic, however, which allows you to implement some sophisticated aggregation routines. If you're looking to do that, you'll want to use the Dataset interface, which is very similar to what you're already used to from RDDs. Specifically, look into creating a custom aggregator:

https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html

There, you define the aggregator methods like initialize, merge, etc. that specify how to create the aggregator, merge individual elements into the aggregate, and combine intermediate aggregates together across executors/tasks.

Once your aggregator is defined, you can use it on a Dataset, e.g.

ds.groupBy(_.myKey).agg(myCustomAggregator)

dchristle
  • 132
  • 1
  • 6