21

What is the difference between reduce and reduceByKey in Apache Spark in terms of their functionalities? Why reduceByKey is a transformation and reduce is an action?

J. P
  • 356
  • 4
  • 20
user1326784
  • 627
  • 3
  • 11
  • 31

3 Answers3

20

This is close to a duplicate of my answer explaining reduceByKey, but I will elaborate to the specific part that makes the two different. However refer to my answer for a bit more specifics on the internals of reduceByKey.

Basically, reduce must pull the entire dataset down into a single location because it is reducing to one final value. reduceByKey on the other hand is one value for each key. And since this action can be run on each machine locally first then it can remain an RDD and have further transformations done on its dataset.

Note, however that there is a reduceByKeyLocally you can use to automatically pull down the Map to a single location also.

Justin Pihony
  • 66,056
  • 18
  • 147
  • 180
0

Please go through this official documentation link .

reduce is an action which Aggregate the elements of the dataset using a function func (which takes two arguments and returns one),also we can use reduce for single RDDs (for more info Please click HERE).

reduceByKey When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. (for more info Please click HERE)

Rajnish Kumar
  • 2,828
  • 5
  • 25
  • 39
0

this is the qt assistant :

reduce(f): Reduces the elements of this RDD using the specified commutative and associative binary operator. Currently reduces partitions locally.

reduceByKey(func, numPartitions=None, partitionFunc=) : Merge the values for each key using an associative and commutative reduce function.

Community
  • 1
  • 1
张文迪
  • 1
  • 1