1

As the title described, is there combine function in spark just like hadoop combine function for reducing the shuffle data transfer. Thanks in Advance.

Jack
  • 5,540
  • 13
  • 65
  • 113
  • 4
    Have a look at reduceByKey / combineByKey which does reduce the keys within partition before shuffling. – Knight71 Jun 22 '15 at 18:00
  • 2
    In addition using MapByPartition and other Partition based operations can help with that. e.g. if you are trying to find the max / min in a dataset. You can compute the max/min for every partition. – Anant Jun 22 '15 at 18:16
  • https://stackoverflow.com/questions/43364432/spark-difference-between-reducebykey-vs-groupbykey-vs-aggregatebykey-vs-combineb – skjagini Oct 16 '19 at 19:21

1 Answers1

1

You want to use aggregateByKey it has an argument for combOp which is identical to a combiner. In most cleanly written code reduceByKey will automatically use the reduce function as the combiner

aaronman
  • 18,343
  • 7
  • 63
  • 78