Python Spark-Streaming get n max keys

Question

I am doing a simple wordcount with Spark Streaming. How do I get the n most used words, or in other words, get the first n keys with the highest values?

Here is my code so far:

counts = lines.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKeyAndWindow(lambda a, b: a+b, lambda a, b: a-b, 30, 2)

output = counts.map(lambda (a, b):(b, a)).transform(lambda rdd: rdd.sortByKey(ascending=False)).map(lambda (a, b):(b, a))

Which already sorts the list in descending order, now I need to just take the top n elements. There is examples out there how to do it in Scala, which uses rdd.take() and then filter the rdd based on list.contains. But Python doesn't have list.contains.

score 1 · Accepted Answer · edited May 23 '17 at 11:54

1

You can always use Python to explore the List. Seems you have not explored much in python. In Pythonyou can perform List operations.

If value in Mylist:
#Do your action

If you wanted to take N number of element from List

list[:10]

will give you the first 10 elements of this list using slicing.

Python List document

Please have look on this answer pythons-slice-notation

edited May 23 '17 at 11:54

Community

1
1

answered Apr 02 '17 at 11:56

Indrajit Swain

1,505
1
15
22

Python Spark-Streaming get n max keys

1 Answers1