Possible Duplicate:
The Most Efficient Way To Find Top K Frequent Words In A Big Word Sequence
Algorithm to find the top 3 occurring words in a book of 1000 pages. Is there a better solution than using a hashtable?
Possible Duplicate:
The Most Efficient Way To Find Top K Frequent Words In A Big Word Sequence
Algorithm to find the top 3 occurring words in a book of 1000 pages. Is there a better solution than using a hashtable?
A potentially better solution is to use a trie-based dictionary. With a trie, you can perform the task in worst-case O(n × N) time where N is the number of words and n is their average length. The difference with a hash table is that the complexity for a trie is independent of any hash function or the book's vocabulary.
There's no way to do better than O(n × N) for arbitrary input since you'll have to scan through all the words.
It is strange, that everybody concentrated on going through the word list and forgot about the main issue - taking k most frequent items. Actually, hash map is good enough to count occurrences, but this implementation still needs sorting, which is de facto O(n*logn) (in best case).
So, hash map implementation needs 1 pass to count words (unguaranteed O(n)) and O(n*logn) to sort it. Tries mentioned here may be better solution for counting, but sorting is still the issue. And again, 1 pass + sorting.
What you actually need is a heap, i.e. tree-based data structure that keeps largest (lowest) elements close to root. Simple implementations of a heap (e.g binary heap) need O(logn) time to insert new elements and O(1) to get highest element(s), so resulting algorithm will take O(n*logn) and only 1 pass. More sophisticated implementations (e.g. Fibonacci heap) take amortized O(1) time for insertion, so resulting algorithm takes O(n) time, which is better than any suggested solution.
You're going to have to go through all of the pages word by word to get an exact answer.
So a linked list implementation that also uses a hashtable interface to store pointers to nodes of the linked list, would do very well.
You need the linked list to grow dynamically and the hashtable to quickly get access to the right needed node so you can update the count.
A simple approach is to use a Dictionary<string, int>
(.net) or HashTable
and count the occurance of each word while scanning the whole book.
Wikipedia says this:
"For certain string processing applications, such as spell-checking, hash tables may be less efficient than tries, finite automata, or Judy arrays. Also, if each key is represented by a small enough number of bits, then, instead of a hash table, one may use the key directly as the index into an array of values. Note that there are no collisions in this case."
I would also have guessed a hash tree.