2

The problem is: Given a list of strings and an integer k, return the top k most frequently occurring words in descending order based on frequency. This must be done O(N) where N is the length of the list of strings.

The popular solution is to store (word, frequency) in a hash table, sort the hash table by frequency, and output the top k words. This is not O(N) however because sorting by frequency takes O(NlgN).

I'm wondering if a O(N) solution actually exists. I've thought about quick-select where you get the kth top most occurring word and sort the remaining frequencies but that would be O(N + klgk) which is still O(NlgN) when k is say N.

Popcorn
  • 5,188
  • 12
  • 54
  • 87
  • Linear time suffix trees do exist, so I think you might be able to use them. But they're hard to implement. Suffix arrays might work too. – user541686 Aug 09 '14 at 17:01
  • No need to sort all the n words; you can use a heap to store only the top k words; this should then be O(n logk) (n for all the words and logk for inserting them into a size-k heap. – tobias_k Aug 09 '14 at 17:04

2 Answers2

1

Yes, it does exist. There is no need to actually sort the pairs. One can find k-th element in O(n)(it's a well known algorithm). Then all elements that are greater than or equal to the k-th element are the top elements.

kraskevich
  • 18,368
  • 4
  • 33
  • 45
1

Yes, there is a O(n) solution.

You can change the famous SELECT algorithm which running in complexity of O(n) to your needs (the algorithm purpose is to find the K'th smallest elements out of N elements).

and later you just pick the the elements which are "greater or equal" to the element the algorithm returns.

more details you can find at : SELECT algorithm

Jumpa
  • 878
  • 1
  • 8
  • 12