Efficient Algorithm (Hashing, Sorting)

Question

I have the got the following problem - I am given a text file containing huge number of words with repetition allowed. I need to write an algorithm that will output those 1000 words that appear most with their frequencies in decreasing order. Here is an example

**input.txt**

aa aa bb cc bb bb bb dd dd

**output.txt** (note - frequencies can be repeated)

bb 4
aa 2
dd 2
cc 1

And here is my way of solving this problem. First read all the words and store them in HashMap with word as the key. The end result of doing this is that I have all the words with their frequencies.
Now I iterate over the HashMap and create an object {word, frequency} for each key value pair and then I insert it in a SortedSet (I wrote a comparator for that).
Now I just iterate over the SortedSet for 1000 times to get the result.

I would like to know a better way of solving this problem.

Thanks

See: http://stackoverflow.com/questions/185697/the-most-efficient-way-to-find-top-k-frequent-words-in-a-big-word-sequence, most notably their use of a heap. — William Gaul, Sep 29 '13 at 01:05
Have you tried this already supplied answer from Java Tutorials [Freq.java](http://docs.oracle.com/javase/tutorial/displayCode.html?code=http://docs.oracle.com/javase/tutorial/collections/interfaces/examples/Freq.java), Detailed Description given [here](http://docs.oracle.com/javase/tutorial/collections/interfaces/map.html). I hope it's what you want :-) — nIcE cOw, Sep 29 '13 at 05:24

Stephen C · Accepted Answer · 2013-09-29T02:01:25.600

Is there a better way of solving this problem?

There are two ways to look at this ... depending on what you mean by "better".

If "better" means "using less time and/or memory", then there are potential ways to address this, depending on the size of the problem and the resources you have available to you. For instance:

If the text file (or files) are large enough to justify this, you could consider partitioning the problem to run on a multi-processor. But it is not straight-forward ... and if the problem turns out to be too small, the overheads of partitioning could actually make the solution slower!
The linked Q&A describes a way that potentially speeds up your steps 2 and 3. The catch is that if you benchmark your code, you will probably find that step 1 is where most of the CPU time is spent. (And the larger the input file is, the more pronounced that effect is likely to be ...)

If "better" means "getting the problem solved quicker", then the answer is "Probably No". Assuming that you have a well implemented version of the algorithm you described, the amount of effort in going to a more sophisticated algorithm that is (potentially) faster may not be worth it.

Computer time is cheaper than software developer time!

My advice would be to do the following:

Benchmark your existing program. Does it run fast enough on realistic input files? If "yes", call the program completed.
Profile your existing program running on realistic input files. Does the profiling show any obvious performance hotspots? Are there any opportunities for code tuning? If "yes" try them, and repeat the benchmarking step to see if they actually helped.
Look at the GC logs. Are there indications that your code has excessive GC overheads? If "yes", see if some simple GC tuning (like increasing the heap size) makes a difference.

If you still haven't got acceptable performance after all of the above, now would be the time to start looking for an alternative algorithm.

Thanks Stephen. I have tested it for a decent size file and it is fast. However before I tackled this problem - I spent 0.5 hour thinking about priority queues but then I gave up. The above mentioned approach is very easy to implement so currently I will stick that. And yes "Computer time is cheaper than software developer time!" that is true :-) — Varun Sharma, Sep 29 '13 at 02:11
Actually that's the fastest I could get so far :-). Thanks for the detailed description. — Varun Sharma, Oct 01 '13 at 03:54

M K · Answer 2 · 2013-09-29T02:55:52.263

Lets focus on steps 2 and 3. Assume there are N words, each with a given frequency, and you have to find top K (1000 in your case).

Your approach runs in O(N log N). As Stephen mentioned, that is good enough :) Here are some possible improvements anyway.

First approach: As suggested in the link, you can use a min-heap (you can use priority queue) to maintain the top 1000. It's not that hard to implement. Once you have the map with the frequencies, start iterating over it. If the size of the min-heap is less than K, keep pushing the pair (word, frequency). If the size of the heap is K, compare the frequency of the current entry from the map to the root of the heap (lowest freqeuncy in the heap). If the new frequency is lower, ignore it. If it is higher, pop the lowest from the heap and push the current one. This approach would run in O(N log K).

Second approach: selection algorithm finds the K_th largest (or smallest) number in a list (unsorted) in O(N). You can find the K_th largest and then iterate over the map to get everything less than or equal to K. If there are more than K, it has to be the case that the K_th largest number is repeated a bunch of times. If you keep track of that, you can remove the required count of this number to get top K. This would take O(N).

Third approach: You can do a variation of randomized quick-sort to find top K. I think, the expected running time is O(N).

Efficient Algorithm (Hashing, Sorting)

2 Answers2