using Spark: binning column1 and find mean of column2 based on column1's bins

Question

I am learning apache spark and scala language. So some help please. I get 3 columns (c1, c2 and c3) from querying cassandra and get it in a dataframe in the scala code.. I have to bin(bin size = 3) (statistics, like in histogram ) c1 and find mean of c2 and c3 in the c1 bins. Are there any pre built functions that I can use to do this instead of traditional for loops and if conditions to achieve this?

I believe this be helpful: http://stackoverflow.com/questions/29930110/how-to-more-efficiently-calculate-the-averages-for-each-key-in-a-pairwise-k-v — evgenii, Apr 15 '16 at 12:18

score 0 · Answer 1 · answered Apr 14 '16 at 07:38

Try this

val modifiedRDD = rdd.map{case(c1, c2, c3) => ((c1), (c2, c3, 1))}
val reducedRDD = modifiedRDD.reduceByKey{case(x, y) => (x._1+y._1, x._2+y._2, x._3+y._3)}

val finalRDD = reducedRDD.map{case((c1), (totalC2, totalC3, count)) => (c1, totalC2/count, totalC3/count)}

using Spark: binning column1 and find mean of column2 based on column1's bins

1 Answers1