I'm trying to do some data analysis that involves aggregations using the pySpark Dataframe API. My understanding is that the groupBy()
operation is equivalent to the groupByKey()
Spark command. Is there a command on the Dataframe API that is equivalent to Spark's reduceByKey()
? My concern is that groupBy()
seems to collects all values for a key into memory, which is not great in terms of performance.
Thanks.