Is there a O(N) solution to getting top k most occurring string in List?

Question

The problem is: Given a list of strings and an integer k, return the top k most frequently occurring words in descending order based on frequency. This must be done O(N) where N is the length of the list of strings.

The popular solution is to store (word, frequency) in a hash table, sort the hash table by frequency, and output the top k words. This is not O(N) however because sorting by frequency takes O(NlgN).

I'm wondering if a O(N) solution actually exists. I've thought about quick-select where you get the kth top most occurring word and sort the remaining frequencies but that would be O(N + klgk) which is still O(NlgN) when k is say N.

Linear time suffix trees do exist, so I think you might be able to use them. But they're hard to implement. Suffix arrays might work too. — user541686, Aug 09 '14 at 17:01
No need to sort all the n words; you can use a heap to store only the top k words; this should then be O(n logk) (n for all the words and logk for inserting them into a size-k heap. — tobias_k, Aug 09 '14 at 17:04

score 1 · Answer 1 · answered Aug 09 '14 at 17:03

1

Yes, it does exist. There is no need to actually sort the pairs. One can find k-th element in O(n)(it's a well known algorithm). Then all elements that are greater than or equal to the k-th element are the top elements.

answered Aug 09 '14 at 17:03

kraskevich

18,368
4
33
45

score 1 · Answer 2 · answered Aug 09 '14 at 19:08

Yes, there is a O(n) solution.

You can change the famous SELECT algorithm which running in complexity of O(n) to your needs (the algorithm purpose is to find the K'th smallest elements out of N elements).

and later you just pick the the elements which are "greater or equal" to the element the algorithm returns.

more details you can find at : SELECT algorithm

Is there a O(N) solution to getting top k most occurring string in List?

2 Answers2