0

I have an RDD in the form (Group,[word1,word2,..wordn]). It contains a group and the words which are under that group. If I have the below input

rdd=(g1,[w1,w2,w4]),(g2[w3,w2]),(g3[w4.w1]),(g3[w1,w2,w3]),(g2[w2])

I would want to collect the output saying how many times a word occurs in a group. The output format will be.

Word  Group1 Group2  Group3
w1     1       0       2
w2     1       2       1
w3     0       1       1
w4     1       0       1

What would be the pyspark functions I can use to achieve this output in most efficient way

vish
  • 33
  • 5
  • What you're asking for can probably be done by converting to DF and calling `explode()` and `pivot()`. You're more likely to get responses if you provided an [mcve] so that people can easily recreate your data. See more here: [how to create good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples). – pault Feb 21 '18 at 17:09

1 Answers1

0

You should use reduceByKey on your rdd to combine the arrays of common keys as

def combineArrays(x, y):
    return x + y
rdd = rdd.reduceByKey(combineArrays)

Then use Counter function of collections to count the occurance of each element in the combined arrays as

from collections import Counter
rdd.mapValues(lambda x: Counter(x))

you should have output as

('g3', Counter({'w1': 2, 'w4': 1, 'w3': 1, 'w2': 1}))
('g1', Counter({'w4': 1, 'w2': 1, 'w1': 1}))
('g2', Counter({'w2': 2, 'w3': 1}))

I hope the answer is helpful

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97