-1

I am trying to understand the AverageByKey and CollectByKey APIs of Spark.

I read this article

http://abshinn.github.io/python/apache-spark/2014/10/11/using-combinebykey-in-apache-spark/

but I dont know if its just me.... I don't understand how these api works

Most confusing part is (x[0] + y[0], x[1] + y[1])

my understanding was that x is sum and y is count. then why are we adding the sum and count?

Knows Not Much
  • 30,395
  • 60
  • 197
  • 373
  • see this answer: http://stackoverflow.com/questions/28240706/explain-the-aggregate-functionality-in-spark-using-python/28241948#28241948 – maasg Mar 02 '15 at 10:24

1 Answers1

1

Instead of:

sumCount = data.combineByKey(lambda value: (value, 1),
                         lambda x, value: (x[0] + value, x[1] + 1),
                         lambda x, y: (x[0] + y[0], x[1] + y[1]))

You can write (x becomes a tuple of total and count)

sumCount = data.combineByKey(lambda value: (value, 1),
                         lambda (total, count), value: (total + value, count + 1),
                         lambda (total1, count1), (total2, count2): (total1 + total2, count1 + count2))

However if you need to compute average DoubleRDD may help.

G Quintana
  • 4,556
  • 1
  • 22
  • 23