1

I am doing some work with Scala and spark - beginner programmer and poster- the goal is to map each request (line) to a pair (userid, 1) then sum the hits.

Can anyone explain in more detail what is happening on the 1st and 3rd line and what the => in: line => line.split means?

Please excuse any errors in my post formatting as I am new to this website.

val userreqs = logs.map(line => line.split(' ')).
   map(words => (words(2),1)).
   reduceByKey((v1,v2) => v1 + v2)
rkk
  • 21
  • 2
  • 8
  • I think `Can anyone explain in more detail what is happening on the 1st and 3rd line and what the => in: line => line.split means?` makes it different. – Jacek Laskowski May 25 '17 at 15:59

3 Answers3

4

considering the below hypothetical log

trans_id amount  user_id
  1       100     A001
  2       200     A002
  3       300     A001
  4       200     A003

this how the data is processed in spark for each operation performed on the logs.

logs                            // RDD("1 100 A001","2 200 A002", "3 300 A001", "3 200 A003")
.map(line => line.split(' '))   // RDD(Array(1,100,A001),Array(2,200,A002),Array(3,300,A001), Array(4,200,A003))
.map(words => (words(2),1))     // RDD((A001,1), (A002,1), (A001,1), (A003,1))
.reduceByKey((v1,v2) => v1+v2)  // RDD(A001,2),A(A002,1),A(`003,1))
  • line.split(' ') splits a string into Array of String. "Hello World" => Array("Hello", "World")
  • reduceByKey(_+_) run a reduce operation grouping data by key. in this case its adds all the values of key. In the above example there were two occurences for the user-key A001 and the value associated with each of those key was 1. This is now reduced to value 2 using the additive function (_ + _) provided in the reduceByKey method.
rogue-one
  • 11,259
  • 7
  • 53
  • 75
1

The easiest way to learn Spark and reduceByKey is to read the official documentation of PairRDDFunctions that says:

reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)] Merge the values for each key using an associative and commutative reduce function.

So it basically takes all the values per key and sums them together to a value that is a sum of all the values per key.

Now, you may be asking yourself:

What is the key?

The key to understand the key (pun intended) is to see how keys are generated and that's the role of the line

map(words => (words(2),1)).

This is where you take words and destructure it into a pair of key and 1.

This is a classic map-reduce algorithm where you give 1 to all keys to reduce them in the following step.

In the end, after this map you'll have a series of key-value pairs as follows:

(hello, 1)
(world, 1)
(nice, 1)
(to, 1)
(see, 1)
(you, 1)
(again, 1)
(again, 1)

I repeated the last (again, 1) pair on purpose to show you that pairs can occur multiple times.

The series is created using RDD.map operator that takes a function that splits a single line and tokenize it into words.

logs.map(line => line.split(' ')).

It reads:

For every line in logs, split the line into tokens using (space) as separator.

I'd change this line to use a regex like \\s+ so any white character would get considered part of a separator.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
0

line.split(' ') splits each line with the space which returns an array of string

For example: "hello spark scala".split(' ') gives [hello, spark, scala]

`reduceByKey((v1,v2) => v1 + v2)`  is equivalent to `reduceByKey(_ + _)`

Here is how reduceByKey works https://i.stack.imgur.com/igmG3.gif and http://backtobazics.com/big-data/spark/apache-spark-reducebykey-example/

For the same key it keeps adding all the values.

Hope this helped!

koiralo
  • 22,594
  • 6
  • 51
  • 72