2

How to add the keys and values separately from the keys and value pairs generated in spark scala?

Given the following input

(5,1),(6,1),(8,1)

I'd like to get to the following output

(19,3)

This is what I've tried so far:

val spark = SparkSession.builder.appName("myapp").getOrCreate()   
val data = spark.read.textFile(args(0)).rdd  
val result =
  data.map { line => {  
    val tokens = line.split("\t")  
    (Float.parseFloat(tokens(4)),1)  
  }}.
  reduceByKey( _+ _)
stefanobaghino
  • 11,253
  • 4
  • 35
  • 63
Chaitanya
  • 63
  • 1
  • 2
  • 10

3 Answers3

1

You can use reduce or fold to get the result, You also need to convert the token(4) value to Int or any other Numeric type as you need.

val result = data.map{line => {  
  val tokens = line.split("\t")  
  (tokens(4).toInt,1)  
}} 

Using fold

result.fold((0,0)) { (acc, x) => (acc._1 + x._1, acc._2 + x._2)}

Using reduce

result.reduce((x,y) => (x._1 + y._1, x._2 + y._2)) 

Hope this helps!

koiralo
  • 22,594
  • 6
  • 51
  • 72
0

reduceByKey won't serve your purpose here. Please use foldLeft.

Refer Scala: How to sum a list of tuples for solving your problem.

Vinod Chandak
  • 373
  • 1
  • 6
  • 15
0
val spark = SparkSession.builder.appName("myapp").getOrCreate()   
val data = spark.read.textFile(args(0)).rdd  
val result = data.map{line => {  
  val tokens = line.split("\t")  
  (tokens(4).toInt,1)  
}}  
.reduce((l, r) => (l._1+r._1, l._2+r._2))

It's possible that a foldLeft (as suggested by Vinod Chandak) is more appropriate, but I tend to use reduce as I have more experience with it.

Travis Hegner
  • 2,465
  • 1
  • 12
  • 11