1

I'm trying to do some data analysis that involves aggregations using the pySpark Dataframe API. My understanding is that the groupBy() operation is equivalent to the groupByKey() Spark command. Is there a command on the Dataframe API that is equivalent to Spark's reduceByKey()? My concern is that groupBy() seems to collects all values for a key into memory, which is not great in terms of performance.

Thanks.

masta-g3
  • 1,202
  • 4
  • 17
  • 27
  • Possible duplicate of [DataFrame groupBy behaviour/optimization](http://stackoverflow.com/questions/32902982/dataframe-groupby-behaviour-optimization) –  Oct 13 '16 at 21:15
  • Possible duplicate http://stackoverflow.com/questions/34249841/spark-dataframe-reducebykey-like-operation – Kristian Oct 14 '16 at 03:33
  • So this means `groupBy` works like `reduceByKey`? It is not entirely clear to me that this is always the expected case. – masta-g3 Oct 15 '16 at 15:54
  • https://stackoverflow.com/questions/39796493/spark-outofmemoryerror-when-taking-a-big-input-file#comment66900222_39796493 - second link explains some cases. –  Oct 15 '16 at 18:32

0 Answers0