reduceByKey equivalent on PySpark Dataframes

Asked Oct 13 '16 at 20:10

Active Oct 13 '16 at 20:10

Viewed 1,544 times

I'm trying to do some data analysis that involves aggregations using the pySpark Dataframe API. My understanding is that the groupBy() operation is equivalent to the groupByKey() Spark command. Is there a command on the Dataframe API that is equivalent to Spark's reduceByKey()? My concern is that groupBy() seems to collects all values for a key into memory, which is not great in terms of performance.

Thanks.

asked Oct 13 '16 at 20:10

masta-g3

1,202
4
17
27

Possible duplicate of [DataFrame groupBy behaviour/optimization](http://stackoverflow.com/questions/32902982/dataframe-groupby-behaviour-optimization) – Oct 13 '16 at 21:15
Possible duplicate http://stackoverflow.com/questions/34249841/spark-dataframe-reducebykey-like-operation – Kristian Oct 14 '16 at 03:33
So this means `groupBy` works like `reduceByKey`? It is not entirely clear to me that this is always the expected case. – masta-g3 Oct 15 '16 at 15:54
https://stackoverflow.com/questions/39796493/spark-outofmemoryerror-when-taking-a-big-input-file#comment66900222_39796493 - second link explains some cases. – Oct 15 '16 at 18:32

reduceByKey equivalent on PySpark Dataframes

0 Answers0