0

My questions is based upon the first answer to this question. I want to count elements for the key, how could i do that?

example = sc.parallelize([(alpha, u'D'), (alpha, u'D'), (beta, u'E'), (gamma, u'F')])


abc=example.groupByKey().map(lambda x : (x[0], list(x[1]))).collect()
# Gives [(alpha, [u'D', u'D']), (beta, [u'E']), (gamma, [u'F'])]

I want output something like below

alpha:2,beta:1, gamma:1

I came to know that the answer is below. Why is it so complex? is there a simpler answer? does s contain key + all values? Why cannot I do len(s)-1 where I am subtracting one to remove the key s[0]

map(lambda s: (s[0], len(list(set(s[1])))))
Community
  • 1
  • 1
user2543622
  • 5,760
  • 25
  • 91
  • 159

1 Answers1

1

Well, it is not complex. What you really need here is yet another word count:

from operator import add

example.map(lambda x: (x[0], 1)).reduceByKey(add)

If you plan to collect you can even countByKey:

example.countByKey()

You really don't want to use groupByKey here but assuming there is some hidden reason to apply it after all:

example.groupByKey().mapValues(len)

Why len(s) - 1 doesn't work? Simply because example is a pairwise RDD or in other words it contains key-value pairs. Same thing applies to the result of groupByKey. It means that len(s) is always equal 2.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • appreciate your answer. But would it be possible to explain why `map(lambda s: (s[0], len(list(set(s[1])))))` works? Does s contain key + all values? Why cannot I do len(s)-1 where I am subtracting one to remove the key s[0] – user2543622 Feb 23 '16 at 00:46
  • your update helps. But isn't my `abc` rdd consist of a tuple (key:value1, value2 value3 etc)? Printing abc gives me `Gives [(alpha, [u'D', u'D']), (beta, [u'E']), (gamma, [u'F'])]`. – user2543622 Feb 23 '16 at 01:02
  • No. It is (key, values) – zero323 Feb 23 '16 at 01:07