I just want to find the averages of all the values associated to a particular key and below is my program:
from pyspark import SparkContext,SparkConf
conf = SparkConf().setAppName("averages").setMaster("local")
sc = SparkContext(conf=conf)
file_rdd = sc.textFile("C:\spark_programs\python programs\input")
vals_rdd = file_rdd.map(lambda x:(x.split(" ")[0],int(x.split(" ")[2])))
print type(vals_rdd)
pairs_rdd = vals_rdd.reduceByKey(lambda x,y:(x+y)/2)
for line in pairs_rdd.collect():
print line
following is the input data:
a hyd 2
b hyd 2
c blr 3
d chn 4
b hyd 5
when I run the program the output which I get is below:
(u'a', 2)
(u'c', 3)
(u'b', 3) -- I could see only got b's value getting averaged.
(u'd', 4)
apart from b's value all the values aren't averaged. Why does it happen? Why aren't a,c,d values averaged??