What is the difference between reduce and reduceByKey in Apache Spark in terms of their functionalities? Why reduceByKey is a transformation and reduce is an action?
3 Answers
This is close to a duplicate of my answer explaining reduceByKey, but I will elaborate to the specific part that makes the two different. However refer to my answer for a bit more specifics on the internals of reduceByKey
.
Basically, reduce
must pull the entire dataset down into a single location because it is reducing to one final value. reduceByKey
on the other hand is one value for each key. And since this action can be run on each machine locally first then it can remain an RDD and have further transformations done on its dataset.
Note, however that there is a reduceByKeyLocally
you can use to automatically pull down the Map to a single location also.

- 66,056
- 18
- 147
- 180
Please go through this official documentation link .
reduce
is an action which Aggregate the elements of the dataset using a function func (which takes two arguments and returns one),also we can use reduce for single RDDs (for more info Please click HERE).
reduceByKey
When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. (for more info Please click HERE)

- 2,828
- 5
- 25
- 39
this is the qt assistant :
reduce(f): Reduces the elements of this RDD using the specified commutative and associative binary operator. Currently reduces partitions locally.
reduceByKey(func, numPartitions=None, partitionFunc=) : Merge the values for each key using an associative and commutative reduce function.