5

I have a huge text file (larger than the available RAM memory). I need to count the frequency of all words and output the word and the frequency count into a new file. The result should be sorted in the descending order of frequency count.

My Approach:

  1. Sort the given file - external sort
  2. Count the frequency of each word sequentially, store the count in another file (along with the word)
  3. Sort the output file based of frequency count - external sort.

I want to know if there are better approaches to do it. I have heard of disk based hash tables? or B+ trees, but never tried them before.

Note: I have seen similar questions asked on SO, but none of them have to address the issue with data larger than memory.

Edit: Based on the comments, agreed the a dictionary in practice should fit in the memory of today's computers. But lets take a hypothetical dictionary of words, that is huge enough not to fit in the memory.

vikky.rk
  • 3,989
  • 5
  • 29
  • 32

4 Answers4

14

I would go with a map reduce approach:

  1. Distribute your text file on nodes, assuming each text in a node can fit into RAM.
  2. Calculate each word frequency within the node. (using hash tables )
  3. Collect each result in a master node and combine them all.
ogzd
  • 5,532
  • 2
  • 26
  • 27
  • 3
    Since the poster claims not even a dictionary of words used in the file fits into his tiny RAM(???) I vote +1 on THIS solution - and this also works when there's just one machine when you do the slices sequentially. – Mörre Feb 07 '13 at 08:21
  • I thought about this approach sequentially, but how would I combine the results efficiently? – vikky.rk Feb 07 '13 at 08:26
  • 1
    Sort each result file individually, then open them all and read line by line, deciding whether to add the results (same word) and/or, depending on sequence in alphabet, which word/nr pair to write to the result file. – Mörre Feb 07 '13 at 08:28
  • Yes that is almost what external sort does. Except that we don't need to sort the entire file, just sorting the slices should be enough. – vikky.rk Feb 07 '13 at 08:42
4

All unique words probably fit in memory so I'd use this approach:

  • Create a dictionary (HashMap<string, int>).
  • Read the huge text file line by line.
  • Add new words into the dictionary and set value to 1.
  • Add 1 to the value of existing words.

After you've parsed the entire huge file:

  • Sort the dictionary by frequency.
  • Write, to a new file, the sorted dictionary with words and frequency.

Mind though to convert the words to either lowercase or uppercase.

Sani Huttunen
  • 23,620
  • 6
  • 72
  • 79
  • nice approach. but would you sort the dictionary between every word? does that result in quicker search for future words? – Fredrik Feb 07 '13 at 08:21
  • no... Sort the dictionary after all words are added. – Sani Huttunen Feb 07 '13 at 08:22
  • Why a `Dictionary`? The class is marked as obsolete. – Matteo Feb 07 '13 at 08:23
  • @Matteo: I'm not suggesting to use the `Dictionary` class. Other than being obsolete it's also an abstract class and would be of no use. The choice of the word `dictionary` is based on what the `HashMap` is used for. – Sani Huttunen Feb 07 '13 at 08:26
  • Assuming the majority of words are not duplicate. Will this approach work fine while reading the content of file size 1 Pebibyte (PiB)? – hemanto Dec 04 '17 at 06:41
  • "probably fit in memory" bad assumption. In 2020 with 16GB or ram, I can't do this in memory (why I'm here). – ldmtwo May 19 '20 at 01:46
  • @ldmtwo: You are not thinking correctly. Let's say the average word length is 8 letters. For simplicity we say 16 GB is 16 000 000 000 bytes. That would mean you could fit 2 000 000 000 **unique** words in memory. Do you know of a language with that many **unique** words? Even if the average word length would be 16 then you'd fit 1 000 000 000 **unique** words in memory. In comparison it has been estimated that the English language consist of roughly 1 000 000 **unique** words and I highly doublt all those are used in a written text in a file. (If it's not a complete word list that is). – Sani Huttunen May 19 '20 at 06:22
3

Best way to achieve it would be to read the file line by line and store the words into a Multimap (e.g. Guava). If this Map extends your memory you could try using a Key-Value store (e.g. Berkeley JE DB, or MapDB). These key-value stores work similar to a map, but they store their values on the HDD. I used MapDB for a similar problem and it was blazing fast.

pushy
  • 9,535
  • 5
  • 26
  • 45
1

If the list of unique words and the frequency fits in memory (not the file just the unique words) you can use a hash table and read the file sequentially (without storing it).

You can then sort the entries of the hash table by the number of occurrences.

Matteo
  • 14,696
  • 9
  • 68
  • 106