The easiest way to learn Spark and reduceByKey
is to read the official documentation of PairRDDFunctions that says:
reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)] Merge the values for each key using an associative and commutative reduce function.
So it basically takes all the values per key and sums them together to a value that is a sum of all the values per key.
Now, you may be asking yourself:
What is the key?
The key to understand the key (pun intended) is to see how keys are generated and that's the role of the line
map(words => (words(2),1)).
This is where you take words
and destructure it into a pair of key and 1
.
This is a classic map-reduce algorithm where you give 1
to all keys to reduce them in the following step.
In the end, after this map
you'll have a series of key-value pairs as follows:
(hello, 1)
(world, 1)
(nice, 1)
(to, 1)
(see, 1)
(you, 1)
(again, 1)
(again, 1)
I repeated the last (again, 1)
pair on purpose to show you that pairs can occur multiple times.
The series is created using RDD.map
operator that takes a function that splits a single line and tokenize it into words.
logs.map(line => line.split(' ')).
It reads:
For every line
in logs
, split the line
into tokens using
(space) as separator.
I'd change this line to use a regex like \\s+
so any white character would get considered part of a separator.