2

I need to print the K most-used words from a text file containing N words in O(N) time complexity.

I have tried it using HashMap by taking the word as key and number of occurrence as value and then sorting it by values. But sorting the HashMap by values takes O(NlogN) which is more than my requirement.

If K = 10 then I need to print the 10 most used words from a text file.

Michael
  • 41,989
  • 11
  • 82
  • 128
Surya Vamsi
  • 21
  • 1
  • 5
  • 2
    You don't need to sort something to get the maximum value. – Andy Turner Apr 12 '21 at 11:26
  • 1
    **If you want the top 1 word with N occurrences...** Add a variable outside the loop `int maxFreq = 0`. When you insert into the map, check the return value. If the new frequency is larger than `maxFreq`, assign it to `maxFreq`. (if you care about the word as well freq, add a 2nd variable `String maxFreqWord = null` and assign that at the same time). **If you want the N most-used words** (e.g. top 5 most frequent words), you can do something similar with a 2nd map, but it's a bit more involved because you need to pay attention to evicting the lowest frequency item – Michael Apr 12 '21 at 11:26
  • 4
    Hang on, is `n` the number of repeated words you need to find, or the number of words in the text file? – Andy Turner Apr 12 '21 at 11:27
  • I want the n most-used words as stated by @Michael . I have fixed the title to avoid confusion. – Surya Vamsi Apr 12 '21 at 11:38
  • 1
    See this answer for a solution https://stackoverflow.com/a/22341665/44522 to the same question. – MicSim Apr 12 '21 at 11:41
  • 1
    This is still not clear. You can't find `n` words in `O(n)` from an arbitrarily long text file. – Henry Apr 12 '21 at 11:55

1 Answers1

0

You can't do sorting, because sorting is as O(NlogN), which apparently can not satisfy you.

Top k problem has the standard stereotype of solutions, which is O(N).

first, use a map to store the frequency. Iterate the list, calculate the frequency of words, then put them into the map.

Second, use a PriorityQueue to calculate the top k frequency, which has a volume of k. Iterate the map, put each entry into the PriorityQueue, if the size of the PriorityQueue is larger than k, then do the poll().

finally, the ProioritiyQueue contains the result of the top K frequency words.