I am doing a simple wordcount with Spark Streaming. How do I get the n most used words, or in other words, get the first n keys with the highest values?
Here is my code so far:
counts = lines.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKeyAndWindow(lambda a, b: a+b, lambda a, b: a-b, 30, 2)
output = counts.map(lambda (a, b):(b, a)).transform(lambda rdd: rdd.sortByKey(ascending=False)).map(lambda (a, b):(b, a))
Which already sorts the list in descending order, now I need to just take the top n elements. There is examples out there how to do it in Scala, which uses rdd.take() and then filter the rdd based on list.contains. But Python doesn't have list.contains.