is there combine function in spark just like hadoop combine function

Question

As the title described, is there combine function in spark just like hadoop combine function for reducing the shuffle data transfer. Thanks in Advance.

Have a look at reduceByKey / combineByKey which does reduce the keys within partition before shuffling. — Knight71, Jun 22 '15 at 18:00
In addition using MapByPartition and other Partition based operations can help with that. e.g. if you are trying to find the max / min in a dataset. You can compute the max/min for every partition. — Anant, Jun 22 '15 at 18:16
https://stackoverflow.com/questions/43364432/spark-difference-between-reducebykey-vs-groupbykey-vs-aggregatebykey-vs-combineb — skjagini, Oct 16 '19 at 19:21

aaronman · Accepted Answer · 2015-06-22T20:12:01.563

1

You want to use aggregateByKey it has an argument for combOp which is identical to a combiner. In most cleanly written code reduceByKey will automatically use the reduce function as the combiner

edited Jun 22 '15 at 20:12

answered Jun 22 '15 at 19:52

aaronman

18,343
7
63
78

is there combine function in spark just like hadoop combine function

1 Answers1