My questions is based upon the first answer to this question. I want to count elements for the key, how could i do that?
example = sc.parallelize([(alpha, u'D'), (alpha, u'D'), (beta, u'E'), (gamma, u'F')])
abc=example.groupByKey().map(lambda x : (x[0], list(x[1]))).collect()
# Gives [(alpha, [u'D', u'D']), (beta, [u'E']), (gamma, [u'F'])]
I want output something like below
alpha:2,beta:1, gamma:1
I came to know that the answer is below. Why is it so complex? is there a simpler answer? does s contain key + all values? Why cannot I do len(s)-1 where I am subtracting one to remove the key s[0]
map(lambda s: (s[0], len(list(set(s[1])))))