Difference between reduce and reduceByKey in Apache Spark

Question

What is the difference between reduce and reduceByKey in Apache Spark in terms of their functionalities? Why reduceByKey is a transformation and reduce is an action?

score 20 · Answer 1 · answered Dec 22 '17 at 02:46

This is close to a duplicate of my answer explaining reduceByKey, but I will elaborate to the specific part that makes the two different. However refer to my answer for a bit more specifics on the internals of reduceByKey.

Basically, reduce must pull the entire dataset down into a single location because it is reducing to one final value. reduceByKey on the other hand is one value for each key. And since this action can be run on each machine locally first then it can remain an RDD and have further transformations done on its dataset.

Note, however that there is a reduceByKeyLocally you can use to automatically pull down the Map to a single location also.

score 0 · Answer 2 · answered Dec 22 '17 at 05:53

Please go through this official documentation link .

reduce is an action which Aggregate the elements of the dataset using a function func (which takes two arguments and returns one),also we can use reduce for single RDDs (for more info Please click HERE).

reduceByKey When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. (for more info Please click HERE)

score 0 · Answer 3 · edited Mar 03 '20 at 14:45

0

this is the qt assistant :

reduce(f): Reduces the elements of this RDD using the specified commutative and associative binary operator. Currently reduces partitions locally.

reduceByKey(func, numPartitions=None, partitionFunc=) : Merge the values for each key using an associative and commutative reduce function.

edited Mar 03 '20 at 14:45

Community

1
1

answered Nov 15 '18 at 14:46

张文迪

1
1

Difference between reduce and reduceByKey in Apache Spark

3 Answers3