Algorithm to find the top 3 occurring words in a book of 1000 pages

Question

Possible Duplicate:
The Most Efficient Way To Find Top K Frequent Words In A Big Word Sequence

Algorithm to find the top 3 occurring words in a book of 1000 pages. Is there a better solution than using a hashtable?

yes, you can solve this using a parallel algorithm. Which still uses hashtables — Keith Nicholas, Apr 17 '12 at 23:12
@KeithNicholas: You can use a parallel algorithm, but chances are that it won't do much good -- at least with reasonably typical hardware, just about any halfway-decent algorithm is virtually certain to be heavily I/O bound. — Jerry Coffin, Apr 17 '12 at 23:14
@KeithNicholas The standard parallel algorithm to use here is MapReduce, which does *not* use hash tables. — btilly, Apr 18 '12 at 05:56

Fred Foo · Answer 1 · 2012-04-17T23:20:41.177

3

A potentially better solution is to use a trie-based dictionary. With a trie, you can perform the task in worst-case O(n × N) time where N is the number of words and n is their average length. The difference with a hash table is that the complexity for a trie is independent of any hash function or the book's vocabulary.

There's no way to do better than O(n × N) for arbitrary input since you'll have to scan through all the words.

edited Apr 17 '12 at 23:20

answered Apr 17 '12 at 23:14

Fred Foo

355,277
75
744
836

I think this would be a good solution. Thanks. – Aravindh Apr 17 '12 at 23:26
1

With trie you will get word frequencies in O(n) time, but not 3 most frequent of them. To get most frequent words you still need to sort resulting counts, which takes not less than O(n*logn). See my answer for more details. – ffriend Apr 18 '12 at 01:58
@ffriend: you can just loop through the trie in O(N×n) time and record the three highest values you find, just like with a hash table. – Fred Foo Apr 18 '12 at 09:43
No, you can't. It is possible to find 1 max item in a list in a linear time, but not 3. For 3 you will need 3 passes of linear search. And in general case with k most frequent items you will get O(n*N*k) time, which is much more than the task actually needs. – ffriend Apr 18 '12 at 10:48
@ffriend: actually, you can do this in a single pass, but even three passes would still take linear time (though multiple passes would make tie breaking a bit harder). The question, btw., is about constant *k*. – Fred Foo Apr 18 '12 at 10:49

score 3 · Answer 2 · edited Apr 18 '12 at 10:46

3

It is strange, that everybody concentrated on going through the word list and forgot about the main issue - taking k most frequent items. Actually, hash map is good enough to count occurrences, but this implementation still needs sorting, which is de facto O(n*logn) (in best case).

So, hash map implementation needs 1 pass to count words (unguaranteed O(n)) and O(n*logn) to sort it. Tries mentioned here may be better solution for counting, but sorting is still the issue. And again, 1 pass + sorting.

What you actually need is a heap, i.e. tree-based data structure that keeps largest (lowest) elements close to root. Simple implementations of a heap (e.g binary heap) need O(logn) time to insert new elements and O(1) to get highest element(s), so resulting algorithm will take O(n*logn) and only 1 pass. More sophisticated implementations (e.g. Fibonacci heap) take amortized O(1) time for insertion, so resulting algorithm takes O(n) time, which is better than any suggested solution.

edited Apr 18 '12 at 10:46

Fred Foo

355,277
75
744
836

answered Apr 18 '12 at 01:55

ffriend

27,562
13
91
132

I was thinking of using a max heap initially, this makes sense. – Aravindh Apr 18 '12 at 03:52
There's absolutely no need to sort. Finding the three highest values is a trivial extension of the single-pass max-finding loop. – Fred Foo Apr 18 '12 at 09:44
@larsmans: I'm pretty sure the there are 3 most frequent words only as example, in general the question is: how to find k most frequent items in the list. In case of exactly 3 items you may iterate 3 times through counts and thus find 3 max values in O(3*m) time (where m is number of unique words). However, for k max items you will need k passes and thus O(k*m) complexity which is not linear anymore. Any attempt to find k max numbers in a single pass (did you ever tried to do it yourself?) will lead to some kind of priority queue, and the best options here are Fibonacci and Brodal heaps. – ffriend Apr 18 '12 at 10:42
For any fixed k we have O(k*m) = O(m), so this won't improve the asymptotic complexity for a fixed k. And the question quite clearly states k is fixed. – Sven Marnach Apr 18 '12 at 10:57
@SvenMarnach: question also states that there are 1000 pages in the book, but does it mean that we should design algorithm only for books with 1000 pages ;) I agree that with fixed k any solution (hash table, tries, heaps) formally gives O(n) time, and with small k their resulting time T will be more or less the same. But if we know that k is always small, there's not reason to go beyond hash maps (which seem to be the easiest solution), and if k may become as large as n, hash maps and tries will give up to O(n^2), so it makes sense to think of more appropriate algorithm. – ffriend Apr 18 '12 at 11:12
this can be done in O(n log K), because while updating the trie with the count of each word, we can also update a min heap of size 'k' elements. the min heap will contain 'k' frequent words. – Aman Jain Sep 15 '13 at 20:10

score 1 · Answer 3 · answered Apr 17 '12 at 23:15

You're going to have to go through all of the pages word by word to get an exact answer.

So a linked list implementation that also uses a hashtable interface to store pointers to nodes of the linked list, would do very well.

You need the linked list to grow dynamically and the hashtable to quickly get access to the right needed node so you can update the count.

score -1 · Answer 4 · answered Apr 17 '12 at 23:09

-1

A simple approach is to use a Dictionary<string, int>(.net) or HashTable and count the occurance of each word while scanning the whole book.

answered Apr 17 '12 at 23:09

Sandeep

7,156
12
45
57

6

Though I wasn't the one who downvoted, I can guess that it's from giving a relatively language-specific answer to a question that doesn't specify anything about implementation language. For example, in C++ it would be called `map` or `unordered_map`, not `Dictionary`. In AWK it's just an associative array. In SNOBOL it's a `table`. – Jerry Coffin Apr 17 '12 at 23:11
2

also the other thing is he is asking for something BETTER than a hashtable. – Keith Nicholas Apr 17 '12 at 23:13
... and without reference to a language/library, `Dictionary` might mean anything from a hashtable to an association list (linear-time lookup). – Fred Foo Apr 17 '12 at 23:19

score -1 · Answer 5 · answered Apr 17 '12 at 23:14

Wikipedia says this:

"For certain string processing applications, such as spell-checking, hash tables may be less efficient than tries, finite automata, or Judy arrays. Also, if each key is represented by a small enough number of bits, then, instead of a hash table, one may use the key directly as the index into an array of values. Note that there are no collisions in this case."

I would also have guessed a hash tree.

score -1 · Answer 6 · answered Apr 18 '12 at 08:42

-1

This algorithm solves with a complexity of

n+lg(n)-2 whener n = 3 here.

http://www.seeingwithc.org/topic3html.html

answered Apr 18 '12 at 08:42

Rockstart

2,337
5
30
59

Algorithm to find the top 3 occurring words in a book of 1000 pages

6 Answers6