PySpark groupByKey finding tuple length

Question

My questions is based upon the first answer to this question. I want to count elements for the key, how could i do that?

example = sc.parallelize([(alpha, u'D'), (alpha, u'D'), (beta, u'E'), (gamma, u'F')])


abc=example.groupByKey().map(lambda x : (x[0], list(x[1]))).collect()
# Gives [(alpha, [u'D', u'D']), (beta, [u'E']), (gamma, [u'F'])]

I want output something like below

alpha:2,beta:1, gamma:1

I came to know that the answer is below. Why is it so complex? is there a simpler answer? does s contain key + all values? Why cannot I do len(s)-1 where I am subtracting one to remove the key s[0]

map(lambda s: (s[0], len(list(set(s[1])))))

It's not complex. The way how we do this is because it's based in `map-reduce` paradigm, which is based in transformations and reductions. — Alberto Bonsanto, Feb 22 '16 at 22:24
I can't be `alpha:2,beta:1, gamma:2`, it shall be `alpha:2,beta:1, gamma:1` — Alberto Bonsanto, Feb 22 '16 at 23:28

zero323 · Answer 1 · 2016-02-23T00:54:27.693

1

Well, it is not complex. What you really need here is yet another word count:

from operator import add

example.map(lambda x: (x[0], 1)).reduceByKey(add)

If you plan to collect you can even countByKey:

example.countByKey()

You really don't want to use groupByKey here but assuming there is some hidden reason to apply it after all:

example.groupByKey().mapValues(len)

Why len(s) - 1 doesn't work? Simply because example is a pairwise RDD or in other words it contains key-value pairs. Same thing applies to the result of groupByKey. It means that len(s) is always equal 2.

edited Feb 23 '16 at 00:54

answered Feb 22 '16 at 23:28

zero323

322,348
103
959
935

appreciate your answer. But would it be possible to explain why `map(lambda s: (s[0], len(list(set(s[1])))))` works? Does s contain key + all values? Why cannot I do len(s)-1 where I am subtracting one to remove the key s[0] – user2543622 Feb 23 '16 at 00:46
your update helps. But isn't my `abc` rdd consist of a tuple (key:value1, value2 value3 etc)? Printing abc gives me `Gives [(alpha, [u'D', u'D']), (beta, [u'E']), (gamma, [u'F'])]`. – user2543622 Feb 23 '16 at 01:02
No. It is (key, values) – zero323 Feb 23 '16 at 01:07

PySpark groupByKey finding tuple length

1 Answers1