0

I am doing a simple wordcount with Spark Streaming. How do I get the n most used words, or in other words, get the first n keys with the highest values?

Here is my code so far:

counts = lines.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKeyAndWindow(lambda a, b: a+b, lambda a, b: a-b, 30, 2)

output = counts.map(lambda (a, b):(b, a)).transform(lambda rdd: rdd.sortByKey(ascending=False)).map(lambda (a, b):(b, a))

Which already sorts the list in descending order, now I need to just take the top n elements. There is examples out there how to do it in Scala, which uses rdd.take() and then filter the rdd based on list.contains. But Python doesn't have list.contains.

SilverTear
  • 695
  • 7
  • 18

1 Answers1

1

You can always use Python to explore the List. Seems you have not explored much in python. In Pythonyou can perform List operations.

If value in Mylist:
#Do your action

If you wanted to take N number of element from List

list[:10]

will give you the first 10 elements of this list using slicing.

Python List document

Please have look on this answer pythons-slice-notation

Community
  • 1
  • 1
Indrajit Swain
  • 1,505
  • 1
  • 15
  • 22