How to print frequency k words from more than 1TB of data?

Question

This question was asked in my interview. I wanted to know about the solution.

Give a text file which contains one word in each line and size of the file is more than 1TB. The task is to print only those words whose frequency is k in the file.

I didnt answer the question fully. But i guess, I started it in the right way. You have use hashing technique and the code take atleast O(n) time (since it has to read through the file)

Can anyone answer me who this can be done efficiently.

Does the question mention the rough size of the vocabulary? A million words? more? — phs, Apr 06 '13 at 08:14

score 5 · Answer 1 · edited May 23 '17 at 12:10

In general this class of problems is the topic of "Top K" or "selection" algorithms. Here's a Wikipedia article on the general topic: Wikipedia: Selection algorithm. It seems to have come into vogue with "Big data" systems generally, and perhaps to get past a previous generation of interviews which focused on sorting algorithms long past the time when every serious candidate had memorized quicksort and heapsort code.

As a practical matter this is just about the textbook problem for which "Big Data" (Hadoop, and other Map/Reduce systems) were built. If the data is distributed over N nodes then each can compute separate partial histograms (mapping their histogram function over the entire data set) and merge their results (reducing their subtotals into grand totals).

For an interview scenario this is a popular question because there is no simple trick. You could enumerate over several published approaches in the academic literature or you could tackle the problem de novo.

If the "vocabulary" is relatively small (for example there are only a few tens of thousands of words in a typical English lexicon --- so a quarter million word vocabulary is pretty extensive). In that case we expect that the count could fit in RAM for typical modern hardware. If the "words" in this data set are more extensive --- more than tens or hundreds of millions --- then such an approach is no longer feasible.

Conceivably one could try an adaptive or statistical approach. If we know that there are no major clusters of any single "word" ... that any statistically significant sample of the data set is roughly similar to any other ... than we could build our histograms and throw away those "words" (and their counts) which are significantly more rare than others. If the data will only be presented as a stream and we are not given any hard guarantee about the distribution of terms then this is not a feasible approach. But if we have the data in some random access filesystem then we could possibly sample the data set sparsely and randomly to build a very likely set of top K * M (where M is some arbitrary multiple of our desired K elements such that it will all fit in RAM).

Hashing might help us locate our counters for each word, but we have to consider the chances of collisions if we try to keep counts of only the hashes without keeping the "words" themselves in the data structure. In general I'd think a heap would be better (possibly including dropping things from the bottom of the in-memory heap out into a storage heap or trie).

I said "adaptive" earlier because one could use caching (ultimately statistical modeling) to keep the currently most frequent "words" in RAM and shuffle the least frequent out to storage (to protect against some degenerate data set where the initially frequent "words" give way to some initially rare word which becomes more frequent as one digs deeper through the data set).

While a conversational exploration of these considerations might work well in some interviews, I'd suggest being familiar with the various sections of the Wikipedia article I've cited so that you can sketch out the psuedo-code to at least one or two of them and show that you do have some academic background in this material.

Absolutely do not neglect to discuss distributed processing in an interview where "Top K" class of questions is being posed. Do so even if only to clarify the constraints of the question being posed and to acknowledge that such problems have been the driving force for modern "big data" distributed processing systems.

Also here's a question on the same general topic: StackOverflow: The Most Efficient Way To Find Top K Frequent Words In A Big Word Sequence.

What an excellent answer! I was about to post my own, but I feel like an amateur now. Thanks! — alzaimar, Apr 06 '13 at 08:47
Honestly I'm all theory. I'm actually a sysadmin by trade. ;) — Jim Dennis, Apr 06 '13 at 08:49

score 0 · Answer 2 · answered Apr 06 '13 at 10:15

Answer to this question totally depend on the size of unique words, If the unique word count is small then you can use any string->number mapping data-structure (e.g. Trie tree) to count word frequency. Complexity will be n log(m) (m is the length of individual words), easy to implement. But the way problem was described, most probably unique word count is big enough to be able to store in the memory. In that case this following approach can be used:

1 TB data means there are about 1.0*10^12 bytes of data in the input file. 1 byte is one character and lets say on average a single word has 4 characters then we have about 2.5*10^11 words. We will divide this word list in to 50k different word lists. So, every time we will read about 5m unread words from the input file, sort this 5m word list and write this sorted list in a file. We will use 50k array of numbers (lets call it Parray) to store the starting location of all sorted list in the file (initially Parray will have numbers like: 0, 5m+1, 10m+1 etc). Now read the top 50k words from all list, put them in a min heap, you will get the smallest word on top of the heap. After getting the current smallest word (lets call it cur_small) from all sorted list read out that word from each list (after this operation your Parray will point to the next smallest word in each list). Here you will get the count of cur_small - so take the decision based on K then remove all entries of cur_small from the heap and lastly add a new word from each list to the heap from where at-least one word was cur_small. Continue this process until you read out all the sorted lists. Over all complexity is n log(n)

While overall complexity is n * log(n) (because you are essentially just doing a huge sort) the practical performance will be dominated by I/O (because of the non-computational overhead --- writing and reading all these segment files). In an interview I highly recommend maintaining a clear awareness of the difference between complexity analysis and practical performance implications. A very elegant academic solution may be out performed by a hack that avoids some real world capacity limitations. On the other hand, consider the impliciations of SSDs, filesystems tuned for them. — Jim Dennis, Apr 07 '13 at 03:22
Tuning and tweaking can be done any time as long as you have a proper algorithm to perform the actual job. So while selecting an algorithm I think all you have to be concern with is the complexity because, a `n log n` algo will always out perform a `n^2` one. — sowrov, Apr 07 '13 at 05:31
Any "Top K" algorithm can be reduced to sorting --- so n * log(n) should be considered the upper bound of any feasible approach. The question then become a matter of whether you can improve on sorting for any given specific case -- such as when K is significantly smaller than the number of distinct/countable "words" in the data set. — Jim Dennis, Apr 07 '13 at 06:42

How to print frequency k words from more than 1TB of data?

2 Answers2