0

I am learning apache spark and scala language. So some help please. I get 3 columns (c1, c2 and c3) from querying cassandra and get it in a dataframe in the scala code.. I have to bin(bin size = 3) (statistics, like in histogram ) c1 and find mean of c2 and c3 in the c1 bins. Are there any pre built functions that I can use to do this instead of traditional for loops and if conditions to achieve this?

B1K
  • 198
  • 1
  • 2
  • 9
  • I believe this be helpful: http://stackoverflow.com/questions/29930110/how-to-more-efficiently-calculate-the-averages-for-each-key-in-a-pairwise-k-v – evgenii Apr 15 '16 at 12:18

1 Answers1

0

Try this

val modifiedRDD = rdd.map{case(c1, c2, c3) => ((c1), (c2, c3, 1))}
val reducedRDD = modifiedRDD.reduceByKey{case(x, y) => (x._1+y._1, x._2+y._2, x._3+y._3)}

val finalRDD = reducedRDD.map{case((c1), (totalC2, totalC3, count)) => (c1, totalC2/count, totalC3/count)}
Abhishek Anand
  • 1,940
  • 14
  • 27