This question is about the duality between DataFrame
and RDD
when it comes to aggregation operations. In Spark SQL one can use table generating UDFs for custom aggregations but creating one of those is typically noticeably less user-friendly than using the aggregation functions available for RDDs, especially if table output is not required.
Is there an efficient way to apply pair RDD operations such as aggregateByKey
to a DataFrame which has been grouped using GROUP BY or ordered using ORDERED BY?
Normally, one would need an explicit map
step to create key-value tuples, e.g., dataFrame.rdd.map(row => (row.getString(row.fieldIndex("category")), row).aggregateByKey(...)
. Can this be avoided?