trying to understand the reduceByKey() action's behaviour

Question

I just want to find the averages of all the values associated to a particular key and below is my program:

from pyspark import SparkContext,SparkConf

conf = SparkConf().setAppName("averages").setMaster("local")
sc = SparkContext(conf=conf)

file_rdd = sc.textFile("C:\spark_programs\python programs\input")

vals_rdd = file_rdd.map(lambda x:(x.split(" ")[0],int(x.split(" ")[2])))

print type(vals_rdd)

pairs_rdd = vals_rdd.reduceByKey(lambda x,y:(x+y)/2)

for line in pairs_rdd.collect():
    print line

following is the input data:

a hyd 2
b hyd 2
c blr 3
d chn 4
b hyd 5

when I run the program the output which I get is below:

(u'a', 2)
(u'c', 3)
(u'b', 3) -- I could see only got b's value getting averaged.
(u'd', 4)

apart from b's value all the values aren't averaged. Why does it happen? Why aren't a,c,d values averaged??

zero323 · Answer 1 · 2017-10-01T13:54:52.593

reduceByKey is used to:

Merge the values for each key using an associative and commutative reduce function.

Function you pass doesn't satisfy these requirements. In particular it is not associative:

f = lambda x,y:(x + y) / 2

f(1, f(2, 3))
## 1.75
f(f(1, 2), 3)
## 2.25

So it is not applicable in your case and it wouldn't average the values.

values aren't averaged. Why does it happen?

Apart from the fundamental flaw explained above, there is only one value for each of the remaining keys, so there is no reason to call merging function at all.

I just want to find the averages values associated to a particular key

Just use DataFrames:

vals_rdd.toDF().groupBy("_1").avg()

although you can use aggregateByKey with StatCounter (numerically stable) or map -> reduceByKey -> map (numerically unstable).

Additionally, I strongly recommend reading great answers to reduceByKey: How does it work internally?.

trying to understand the reduceByKey() action's behaviour

1 Answers1