Runtime of this implementation to find k most frequently recurring words in a string

Question

Here's my algorithm:

(a) Make a WordOccurrence class that contains a String sentence and an int numberOfOccurrences. Let's assume the sentence has n words. Make this class implement Comparable, and make the comparator the numberOfOccurrences first, and then (arbitrarily,) the alphabetical (natural) ordering of the word second.

(b) Iterate through sentence with .split(‘ ‘) (or do it in place with iteration to save space). Create a new WordOccurrence object for each unique word, updating its occurrence and placing all WordOccurrence objects into a TreeMap<WordOccurrence>.

(c) Create a new WordOccurrence object for each unique word, placing all WordOccurrence objects into a TreeMap<WordOccurrence> (and updating words' occurrences along the way).

(d) Call highestKey() on the TreeMap, and place the returned word into a resultant list. Then call lowerKey with the previously returned word (k - 1) times, placing the words into the same resultant list.

(e) Return the resultant list.

My question: What is the runtime of this? This is my understanding:

Step (b) takes O(n) time.

Step (c) takes O(n*log(n)), as for each of the n words, insertion is O(log n).

Step (d) takes O(k*log(n)), as each call of highestKey() or lowerKey() takes O(log n) time.

So the total runtime would be: O(n + n*log(n) + k*log(n), which is O(n*log n).

Is there a tighter bound for this algorithm, or a way to get it to O(n)?

Thanks.

You could use a HashMap instead of a Treemap for O(1) updates and a bucket/radix sort to make (d) O(N) for reasonable counts. — Peter Lawrey, Sep 09 '14 at 20:48
After step (b), you should be able to compute the `k` "greatest" words in O(k*n) time, by keeping an array of size `k`, going through the map, and replacing the "least" word in the array with a word from the map if the word from the map is "greater" than it. Your method seems to put *all* the words in sorted order, which is unnecessary if you want only the greatest `k` words. If `k` is a lot less than `n` than this should be better then O(n log n). — ajb, Sep 09 '14 at 20:53
@PeterLawrey: Extra credit: Which buckets should you use so that all the counts are "reasonable"? — tmyklebu, Sep 09 '14 at 20:57
@tmyklebu the upper bound is string.length()/2. If you used a List of Lists, it should be still practical. — Peter Lawrey, Sep 09 '14 at 21:12

score 0 · Answer 1 · answered Sep 09 '14 at 20:54

Yours is a poor approach for several reasons:

Swaddling everything in an object is going to make the constant factor here unreasonably high. You're creating at least two objects per word in the string; given that the average English word is six letters, you can figure out the average blowup.
You can use a TreeMap<String, Integer> instead of using this WordOccurrence thingamajig. This is still horrible for the first reason, but it reduces the amount of code you have to write.
Your reasoning does not take into consideration the fact that the keys are strings and hence variable-length. I do not think this affects anything, but I would mark it down if I were a teaching assistant.
For the data structure, I would use a hash table instead. You can use a HashMap in Java instead and you get something slightly better. You might look at Trove or something for a HashMap whose values are ints instead of Integers.

Also, there is a data structure called a suffix tree. You can build a suffix tree on your string in linear time and explore it to figure out all the distinct words and the count of each. Then you can do linear-time selection to find the top k. This certainly loses to the hash table in practise, but in theory it avoids constant-time-lookup hash tables.

score 0 · Answer 2 · answered Sep 09 '14 at 21:03

I believe your analysis is tight for this algorithm. As for a tighter bound:
--Use a HashMap instead of a TreeMap. Now step (c) runs in O(N).
--After step (c), create a TreeMap.
--Iterate over the entries in your HashMap. For each entry, add it to the TreeMap. If the size of your TreeMap is greater than k, call remove(lastEntry()).

At the end you have the k entries you wanted. Since your TreeMap is never larger than k+1, the running time complexity is O(n) + O(k*log(n)) < O(n*log(n)).

Runtime of this implementation to find k most frequently recurring words in a string

2 Answers2