Apache Spark AverageByKey and CollectByKey Explanation

Question

I am trying to understand the AverageByKey and CollectByKey APIs of Spark.

I read this article

http://abshinn.github.io/python/apache-spark/2014/10/11/using-combinebykey-in-apache-spark/

but I dont know if its just me.... I don't understand how these api works

Most confusing part is (x[0] + y[0], x[1] + y[1])

my understanding was that x is sum and y is count. then why are we adding the sum and count?

see this answer: http://stackoverflow.com/questions/28240706/explain-the-aggregate-functionality-in-spark-using-python/28241948#28241948 — maasg, Mar 02 '15 at 10:24

score 1 · Answer 1 · answered Mar 02 '15 at 09:55

Instead of:

sumCount = data.combineByKey(lambda value: (value, 1),
                         lambda x, value: (x[0] + value, x[1] + 1),
                         lambda x, y: (x[0] + y[0], x[1] + y[1]))

You can write (x becomes a tuple of total and count)

sumCount = data.combineByKey(lambda value: (value, 1),
                         lambda (total, count), value: (total + value, count + 1),
                         lambda (total1, count1), (total2, count2): (total1 + total2, count1 + count2))

However if you need to compute average DoubleRDD may help.

Apache Spark AverageByKey and CollectByKey Explanation

1 Answers1