Count word frequency of huge text file

Question

I have a huge text file (larger than the available RAM memory). I need to count the frequency of all words and output the word and the frequency count into a new file. The result should be sorted in the descending order of frequency count.

My Approach:

Sort the given file - external sort
Count the frequency of each word sequentially, store the count in another file (along with the word)
Sort the output file based of frequency count - external sort.

I want to know if there are better approaches to do it. I have heard of disk based hash tables? or B+ trees, but never tried them before.

Note: I have seen similar questions asked on SO, but none of them have to address the issue with data larger than memory.

Edit: Based on the comments, agreed the a dictionary in practice should fit in the memory of today's computers. But lets take a hypothetical dictionary of words, that is huge enough not to fit in the memory.

if you just want to count words, you can read the file line-by-line or by file stream. You don't need to load the whole file into RAM. — Raptor, Feb 07 '13 at 08:14
@xvorsx: No single word is larger than RAM, if that is what your are asking. — vikky.rk, Feb 07 '13 at 08:15
How many _different_ words are there in the file? Would they fit in memory if you don't store duplicates? — comocomocomocomo, Feb 07 '13 at 08:15
Yes, the number of unique words is greater than the size of RAM. So if I have a hash table it would not fit in the memory — vikky.rk, Feb 07 '13 at 08:16
Really? How much RAM? Even a complete dictionary fits into today's computers RAM... — Mörre, Feb 07 '13 at 08:17
@vikky.rk The question was not if a single word will fit in memory but if the list of unique words will. — Matteo, Feb 07 '13 at 08:17
In practice yes the dictionary should fit, but say the dictionary if big enough not to fit in the memory, I want to know what approach would be better — vikky.rk, Feb 07 '13 at 08:21
Well, in that case ogzd gives a good answer below: slice it up! — Mörre, Feb 07 '13 at 08:23

score 14 · Accepted Answer · answered Feb 07 '13 at 08:15

14

I would go with a map reduce approach:

Distribute your text file on nodes, assuming each text in a node can fit into RAM.
Calculate each word frequency within the node. (using hash tables )
Collect each result in a master node and combine them all.

answered Feb 07 '13 at 08:15

ogzd

5,532
2
26
27

3

Since the poster claims not even a dictionary of words used in the file fits into his tiny RAM(???) I vote +1 on THIS solution - and this also works when there's just one machine when you do the slices sequentially. – Mörre Feb 07 '13 at 08:21
I thought about this approach sequentially, but how would I combine the results efficiently? – vikky.rk Feb 07 '13 at 08:26
1

Sort each result file individually, then open them all and read line by line, deciding whether to add the results (same word) and/or, depending on sequence in alphabet, which word/nr pair to write to the result file. – Mörre Feb 07 '13 at 08:28
Yes that is almost what external sort does. Except that we don't need to sort the entire file, just sorting the slices should be enough. – vikky.rk Feb 07 '13 at 08:42

Sani Huttunen · Answer 2 · 2013-02-07T08:29:26.940

4

All unique words probably fit in memory so I'd use this approach:

Create a dictionary (HashMap<string, int>).
Read the huge text file line by line.
Add new words into the dictionary and set value to 1.
Add 1 to the value of existing words.

After you've parsed the entire huge file:

Sort the dictionary by frequency.
Write, to a new file, the sorted dictionary with words and frequency.

Mind though to convert the words to either lowercase or uppercase.

edited Feb 07 '13 at 08:29

answered Feb 07 '13 at 08:16

Sani Huttunen

23,620
6
72
79

nice approach. but would you sort the dictionary between every word? does that result in quicker search for future words? – Fredrik Feb 07 '13 at 08:21
no... Sort the dictionary after all words are added. – Sani Huttunen Feb 07 '13 at 08:22
Why a `Dictionary`? The class is marked as obsolete. – Matteo Feb 07 '13 at 08:23
@Matteo: I'm not suggesting to use the `Dictionary` class. Other than being obsolete it's also an abstract class and would be of no use. The choice of the word `dictionary` is based on what the `HashMap` is used for. – Sani Huttunen Feb 07 '13 at 08:26
Assuming the majority of words are not duplicate. Will this approach work fine while reading the content of file size 1 Pebibyte (PiB)? – hemanto Dec 04 '17 at 06:41
"probably fit in memory" bad assumption. In 2020 with 16GB or ram, I can't do this in memory (why I'm here). – ldmtwo May 19 '20 at 01:46
@ldmtwo: You are not thinking correctly. Let's say the average word length is 8 letters. For simplicity we say 16 GB is 16 000 000 000 bytes. That would mean you could fit 2 000 000 000 **unique** words in memory. Do you know of a language with that many **unique** words? Even if the average word length would be 16 then you'd fit 1 000 000 000 **unique** words in memory. In comparison it has been estimated that the English language consist of roughly 1 000 000 **unique** words and I highly doublt all those are used in a written text in a file. (If it's not a complete word list that is). – Sani Huttunen May 19 '20 at 06:22

score 3 · Answer 3 · answered Feb 07 '13 at 08:24

Best way to achieve it would be to read the file line by line and store the words into a Multimap (e.g. Guava). If this Map extends your memory you could try using a Key-Value store (e.g. Berkeley JE DB, or MapDB). These key-value stores work similar to a map, but they store their values on the HDD. I used MapDB for a similar problem and it was blazing fast.

score 1 · Answer 4 · answered Feb 07 '13 at 08:15

If the list of unique words and the frequency fits in memory (not the file just the unique words) you can use a hash table and read the file sequentially (without storing it).

You can then sort the entries of the hash table by the number of occurrences.

Count word frequency of huge text file

4 Answers4