Find the 'n' most repeating words/strings in a huge file that does not fit in memory

Question

I would like to verify my psuedocode, suggest optimizations and better approaches.

mostRepeatingWords(int rank):
(rank here defines how many ranks you want to choose, i.e. top X most repeating words)

External sort all files alphabetically. Following the algorithm mentioned here.
Complexity:ø (n log n/m)
Create a new file "wordCount" which would contain entries like

"word1" : 3 - i.e. word1 was repeated three times
"word2" : 1 - i.e. word2 was unique.

for ( read each of the sorted file one by one ) { 
    for (read each "word" in the current file) { 
        int count =  checkIfDuplicateAndReturnCount(); 
        enter "word" and "count" in the "wordCount" file. 

        sanitizeFile(); 

        if  (wordCount file is > page size) { 
                write wordCount file to disk. 
                create a new wordCount file.
         } 

        move to next word ie currentword + count;
    } 
}

Complexity O (n + p) where n is the number of sorted pages and p <= n, is number of "wordcount" pages

checkIfDuplicateAndReturnCount() is a simple function which will compare this element with first and previous etc. and determine the frequency of the word.
sanitizeFile() is used when pages are all flooded with same word. Lets say size of a page is 4KB, and let's say but number of pages, on sorting, containing the word common word "the" is greater than (4KB / size of word). Then we may need to create a new file. But sanitize file will take an extra step to combine two entries in the file for same word. Example:
- the : 500
- the : 600
Would be combined to:
- the : 1100

for  (reading the all the wordcount files one by one ) {
  for (each word in the wordcount) {
    Use a minimum heap / priority queue of size "rank" to keep track of the frequency.
    if (wordcount > heap.peek()) {
        heap.addAtPositionZero(word);
    }   
  }
}

Complexity is O(p)

Complexity: ø ( n log n / m ) + O (n + p ) + O(p) effectively : ø ( n log n / m )

You might get better responses if you asked a more specific question. I'd note, however, that the initial "external sort all files alphabetically" will inevitably take a long time for large files, being O(n log n). I'd be inclined to use a hash table to count occurrences of all unique words, which can be done with a single run through the data, then partially sort the hash table to find the top 10 or 20 unique words. Even a complete sort of the unique words would take much less effort than sorting all the words in the huge file. — Simon, Feb 04 '14 at 23:48
@Simon The hash table would not be able to fit in memory. Imagine a case where all words are unique except for the first and last. It would have to hold everything in memory until it found the answer. An external sort is a good start/suggestion. — Matt, Feb 04 '14 at 23:56
Your step 6 is not O(p). If `n` is the total number of unique words and `k` is the number of words you want to select, the worst case behavior of that heap selection algorithm is O(n log k). — Jim Mischel, Feb 05 '14 at 04:00

score 17 · Accepted Answer · edited May 23 '17 at 12:17

There's no need to sort the files ahead of time. Doing so is just a bunch of unnecessary I/O.

A more straightforward way to do this is:

Create an empty dictionary (hash map), keyed by word. The value is the count.
for each file
    while not end of file
        read word
        if word in dictionary
            update count
        else
            if dictionary full
                sort dictionary by word
                output dictionary to temporary file
                Clear dictionary
            Add word to dictionary, with count 1
    end
end
if dictionary not empty
    sort dictionary by word
    output dictionary to temporary file

You now have some number of temporary files, each sorted by word and containing one word/count pair per line. Like:

aardvark,12
bozo,3
zebra,5

Create a min-heap that you will use to hold your n largest items. Call it largest_items.

Do a standard n-way merge of those temporary files. As you find each unique item (i.e. you merge all of the "aardvark" entries across the multiple files), you do this:

if (largest_items.count < n)
    largest_items.add(word)
else if (word.count > largest_items.peek().count)
{
    // the count for this word is more than the smallest count
    // already on the heap. So remove the item with the
    // smallest count, and add this one.
    largest_items.remove_root()
    largest_items.add(word)
}

Complexity:

Building the dictionaries is O(N), where N is the total number of individual words in the file.
Sorting each temporary dictionary is O(k log k), where 'k' is the number of words in the dictionary.
Writing each temporary dictionary is O(k)
The merge is O(M log x), where M is the combined number of entries across all the temporary files, and x is the number of temporary files.
Selecting the items is O(m log n), where m is the number of unique words, and n is the number of words you want to select.

If you look at worst case behavior (i.e. all the words are unique), the complexity works out to (n is the total number of words):

Building the dictionaries is O(n)
Sorting and writing the temporary files is (n/m) * (m log m), where m is the dictionary size.
The merge is n log (n/m).
Selection is O(m + (k log k)), where k is the number of words you want to select and m is the number of unique words. Because all words are unique they have the same count, so you'll only do k inserts into the heap. The m term dominates when k is much smaller than m (which is usually the case in these situations). So selection turns out to be O(m).

When you're working with data sets larger than memory, very often your bottleneck is file I/O. The algorithm I've outlined above tries to minimize the I/O. In the worst case (all words are unique), each word will be read twice and written once. But in the general case each word is read once and then each hash page is written once and read once. Plus, your sorts are on hash pages rather than raw words, and the total size of your temporary files will be much smaller than the original text.

score 0 · Answer 2 · answered Feb 05 '14 at 01:20

One practical solution, which has the benefit of being trivial to parallelise:

Use a hashing algorithm to assign each word to one of k buckets, where k is large enough that the largest bucket will "probably" fit in memory. (If it turns out that k is too small, you can try again with a larger k. Another solution is to just rehash the bucket which doesn't fit into two pieces using a different hash function. But if you use a good hash function and the data is not pathological, this will never be necessary.)
Process the data k times, each time working with only one bucket. Using a different hash function, insert all of the data in one bucket into a hash table, and then select the n most common entries; save these externally.
Partition all of the saved groups of n most common entries in a bucket to select the n most common words.

score 0 · Answer 3 · answered Jan 22 '20 at 15:35

Read the file and create a new file one for each word, this file will contain a count of occurrence of the word.

Keep 2 variable also one is will store word and other will store occurrence.

read file line by line
For each word check if a file is present or not, if the file is present then read the count form the file or if the file is not present, then create a new file.
increment the count and write it back to file.
Before going to the next word check if the count of the current word is greater than the max count variable which we have kept if it is greater then update both the variables

Repeat the process until we read the complete file.

The complexity will be O(m * n) where m is no of lines in file and n is the count of words. Correct me if I am wrong in calculating the time complexity.

score -1 · Answer 4 · answered Feb 05 '14 at 06:12

You may simply replace your dictionary implementation with disk-based one such as

http://dbh.sourceforge.net/

Note that you can use OS filesystems for this.

If you are really interested in algorithmic complexity of your task, please bear in mind that you cannot simply use big-O notation when you count both of CPU steps and disk I/O. There is 1,000,000 fold difference in their latencies.

You may find interesting results from googling with "cache-oblivious". See

http://blogs.msdn.com/b/devdev/archive/2007/06/12/cache-oblivious-data-structures.aspx

Find the 'n' most repeating words/strings in a huge file that does not fit in memory

4 Answers4

Linked