As the title described, is there combine function in spark just like hadoop combine function for reducing the shuffle data transfer. Thanks in Advance.
Asked
Active
Viewed 321 times
1
-
4Have a look at reduceByKey / combineByKey which does reduce the keys within partition before shuffling. – Knight71 Jun 22 '15 at 18:00
-
2In addition using MapByPartition and other Partition based operations can help with that. e.g. if you are trying to find the max / min in a dataset. You can compute the max/min for every partition. – Anant Jun 22 '15 at 18:16
-
https://stackoverflow.com/questions/43364432/spark-difference-between-reducebykey-vs-groupbykey-vs-aggregatebykey-vs-combineb – skjagini Oct 16 '19 at 19:21
1 Answers
1
You want to use aggregateByKey it has an argument for combOp which is identical to a combiner. In most cleanly written code reduceByKey will automatically use the reduce function as the combiner

aaronman
- 18,343
- 7
- 63
- 78