There are billions of words given to you. You have to find the top K frequent words.
For above problem, I have some solutions, but I think there is definitely better approach present. First please have a look on my approach.
Solutions :-
Approach 1
- Make a chunks of above file. For every chunk, make pair, where key = word, and value = 1.
- Then sort pairs in each chunks on the basis of key.
- Then, make one entry for commmon key. Now value become frequency of key.
- Till this point every chunk have words with their frequency.
- Now concatenate the whole pairs of each chunk. Then sort. Then update the frequency.
- Throw top K frequency words.
Approach 2.(This approach is not complete)
- Make a double link list, in which front is mostly occurring word, and rear one least occurring word.
- Whenever new word comes, update the doubly linked list.
But approach 2 is not correct.
Please suggest if you have any algorithm that is better than Approach 1. And please check approach 2, improve this.