I have an RDD which has 2 partition and key value pair data as value:
rdd5.glom().collect()
[[(u'hive', 1), (u'python', 1), (u'spark', 1), (u'hive', 1), (u'spark', 1), (u'python', 1)], [(u'spark', 1), (u'java', 1), (u'java', 1), (u'spark', 1)]]
When I perform aggregateByKey
rdd6=rdd5.aggregateByKey((0,0), lambda acc,val: (acc[0]+1,acc[1]+val), lambda acc1,acc2 : (acc1[1]+acc2[1])/acc1[0]+acc2[0])
It is not giving me expected result:
Output:
[(u'python', (2, 2)), (u'spark', 1), (u'java', (2, 2)), (u'hive', (2, 2))]
Expected:
[(u'python', 1), (u'spark', 1), (u'java', 1), (u'hive', 1)]
I can see key present in one partition only not giving me expected output. What changes should I make to achieve that?