Find K most frequent words from billions of given words

Question

There are billions of words given to you. You have to find the top K frequent words.

For above problem, I have some solutions, but I think there is definitely better approach present. First please have a look on my approach.

Solutions :-

Approach 1

Make a chunks of above file. For every chunk, make pair, where key = word, and value = 1.
Then sort pairs in each chunks on the basis of key.
Then, make one entry for commmon key. Now value become frequency of key.
Till this point every chunk have words with their frequency.
Now concatenate the whole pairs of each chunk. Then sort. Then update the frequency.
Throw top K frequency words.

Approach 2.(This approach is not complete)

Make a double link list, in which front is mostly occurring word, and rear one least occurring word.
Whenever new word comes, update the doubly linked list.

But approach 2 is not correct.

Please suggest if you have any algorithm that is better than Approach 1. And please check approach 2, improve this.

For storing the word and its count you can use a max-heap, which will make it easier to find the top most K words — AD.Net, Jun 15 '14 at 18:30
I doubt you'll get approach 2 to work faster than O(n^2) (with n being the number of (unique) words) - a linked-list isn't particularly well-suited for this problem. — Bernhard Barker, Jun 15 '14 at 18:34
@AD.Net But the problem is, if any word is less frequent at current, And we drop that word. But that word can come frequently. Please tell me how to handle that situation. — devsda, Jun 15 '14 at 18:36
@Dukeling Suppose if it is not possible to persist all the words in a RAM. Then in that situation you have to drop words intelligently. I am confused here. — devsda, Jun 15 '14 at 18:37
I think you can prove an Omega(n) space lower bound if you want linear time (independent of k). Of course that doesn't mean you need to store everything in RAM, you can store it on disk as well. — Niklas B., Jun 15 '14 at 21:24
@NiklasB. It can be done in O(k) space with linear time by holding an array of 2k elements, and repeatidly: Find k'th largest element [`O(k)`], discard all smaller elements from the array. This is done in `O(k)` time, and it repeats `O(n/k)` times - giving you `O(n)` time with only `O(k)` extra space. — amit, Jun 16 '14 at 09:31
@NiklasB. Apologies, I meant only the "find top-k" part. Finding the frequencies is easily reduceable to element-distinctness, so indeed you need Omega(n) space for O(n) time — amit, Jun 16 '14 at 09:34

Find K most frequent words from billions of given words

0 Answers0