Most frequent words in a terabyte of data

Question

I came across a problem where we have to find say the most 10 frequent words in a terabyte of file or string.

One solution I could think was using a hash table (word, count) along with a max heap. But fitting all the words if the words are unique might cause a problem. I thought of another solution using Map-Reduce by splitting the chunks on different nodes. Another solution would be to build a Trie for all the words and update the count of each word as we scan through the file or string.

Which one of the above would be a better solution? I think the first solution is pretty naive.

Do you need always optimal answer or are you interesting in some kind herustic algorithm? — Ari, Sep 21 '12 at 07:03
Actually I am looking for a optimal algorithm which would minimize the time and space complexity as much as possible. And considering that the entire data cannot be read in memory. — Akshay, Sep 21 '12 at 07:11
possible duplicate of [Parsing one terabyte of text and efficiently counting the number of occurrences of each word](http://stackoverflow.com/questions/12190326/parsing-one-terabyte-of-text-and-efficiently-counting-the-number-of-occurrences) — amit, Sep 21 '12 at 09:35
The mentioned question had a bounty - so there is a large variety of answers including: (1) an extended discussion how to answer it if it is an interview, (2) great problem analysis, (3) map-reduce solution, (4) mapping solution (5) SQL based solution, and more. — amit, Sep 21 '12 at 09:40

Jirka Hanika · Answer 1 · 2012-09-22T20:17:57.423

Split your available memory into two halves. Use one as a 4-bit counting Bloom filter and the other half as a fixed size hash table with counts. The role of the counting Bloom filter is to filter out rarely occuring words with high memory efficiency.

Check your 1 TB of words against the initially empty Bloom filter; if a word is already in and all buckets are set to the maximum value of 15 (this may be partly or wholly a false positive), pass it through. If it is not, add it.

Words that passed through get counted; for a majority of words, this is every time but the first 15 times you see them. A small percentage will start to get counted even sooner, bringing a potential inaccuracy of up to 15 occurrences per word into your results. That's a limitation of Bloom filters.

When the first pass is over, you can correct the inaccuracy with a second pass if desired. Deallocate the Bloom filter, deallocate also all counts that are not within 15 occurrences behind the tenth most frequent word. Go through the input again, this time accurately counting words (using a separate hash table), but ignoring words that have not been retained as approximate winners from the first pass.

Notes

The hash table used in the first pass may theoretically overflow with certain statistical distributions of the input (e.g., each word exactly 16 times) or with extremely limited RAM. It is up to you to calculate or try out whether this can realistically happen to you or not.

Note also that the bucket width (4 bits in the above description) is just a parameter of the construction. A non-counting Bloom filter (bucket width of 1) would filter out most unique words nicely, but do nothing to filter out other very rarely occuring words. A wider bucket size will be more prone to cross-talk between words (because there will be fewer buckets), and it will also reduce guaranteed accuracy level after the first pass (15 occurrences in the case of 4 bits). But these downsides will be quantitatively insignificant until some point, while I'm imagining the more aggresive filtering effect as completely crucial for keeping the hash table in sub-gigabyte sizes with non-repetitive natural language data.

As for the order of magnitude memory needs of the Bloom filter itself; these people are working way below 100 MB, and with a much more challenging application ("full" n-gram statistics, rather than threshold 1-gram statistics).

I got the concept as a whole. But can u please briefly explain paragraph 2. I am unclear about why we need to check if the maximum value of 15 in the buckets. And also about why only 4-bit counting bloom filter is required. — Akshay, Sep 22 '12 at 00:06
@Askhay - I added some discussion of that. If I knew for sure that your 1 TB is English prose, I would venture to suggest 10 bit counting. Too wide buckets may however lead to a problem not mentioned in the answer - no words will pass if you make the buckets, say, 40 bit wide. — Jirka Hanika, Sep 22 '12 at 20:25

score 3 · Answer 2 · answered Sep 21 '12 at 09:10

Sort the terabyte file alphabetically using mergesort. In the initial pass, use quick sort using all available physical RAM to pre-sort long runs of words.

When doing so, represent a continuous sequence of identical words by just one such word and a count. (That is, you are adding the counts during the merges.)

Then resort the file, again using mergesort with quick sort presorting, but this time by the counts rather than alphabetically.

This is slower but simpler to implement than my other answer.

score 2 · Answer 3 · answered Sep 21 '12 at 07:20

2

The best I could think of:

Split data to parts you can store in memory.
For each part get N most frequent words, you will get N * partsNumber words.
Read all data again counting words you got before.

It won't always give you correct answer, but it will work in fixed memory and linear time.

answered Sep 21 '12 at 07:20

Ari

3,101
2
27
49

Yup seems like a better solution. Time O(n) and space N*partsNumber. – Akshay Sep 21 '12 at 08:23
Like you stated it doesn't always give you the correct answer, so I don't see how it is a correct algorithm – Yarneo Sep 21 '12 at 09:05
@Yarneo What do you mean by "correct algorithm". I think algorithm, which gives some good answer (most cases best, sometimes near best) is better than algorithm which would give always best answer, if it has TBs of memory and years of time. – Ari Sep 21 '12 at 09:15
Im sorry about the too "head to the wall" answer, you are right about that. What I meant to say , is that at least for this problem, which is a classic map-reduce problem, you could get a precise answer with good time complexity. – Yarneo Sep 21 '12 at 09:20
@Yarneo You are 100% right *if* there is fast and correct (best answer) algorithm. I don't know such algorithm, but you say there is one so my answer isn't good – Ari Sep 21 '12 at 09:42

score 0 · Answer 4 · answered Jan 26 '20 at 10:07

And why do you think a building of the Trie structure is not the best decision? Mark all of child nodes by a counter and that's it! Maximum memory complexity will be O(26 * longest_word_length), and time complexity should be O(n), that's not bad, is it?

Most frequent words in a terabyte of data

4 Answers4