I would like to verify my psuedocode, suggest optimizations and better approaches.
mostRepeatingWords(int rank):
(rank here defines how many ranks you want to choose, i.e. top X most repeating words)
External sort all files alphabetically. Following the algorithm mentioned here.
Complexity:ø (n log n/m)Create a new file "wordCount" which would contain entries like
"word1" : 3
- i.e. word1 was repeated three times
"word2" : 1
- i.e. word2 was unique.for ( read each of the sorted file one by one ) { for (read each "word" in the current file) { int count = checkIfDuplicateAndReturnCount(); enter "word" and "count" in the "wordCount" file. sanitizeFile(); if (wordCount file is > page size) { write wordCount file to disk. create a new wordCount file. } move to next word ie currentword + count; } }
Complexity O (n + p) where n is the number of sorted pages and p <= n, is number of "wordcount" pages
checkIfDuplicateAndReturnCount()
is a simple function which will compare this element with first and previous etc. and determine the frequency of the word.sanitizeFile()
is used when pages are all flooded with same word. Lets say size of a page is 4KB, and let's say but number of pages, on sorting, containing the word common word "the" is greater than (4KB / size of word). Then we may need to create a new file. But sanitize file will take an extra step to combine two entries in the file for same word. Example:- the : 500
- the : 600
Would be combined to:
- the : 1100
for (reading the all the wordcount files one by one ) { for (each word in the wordcount) { Use a minimum heap / priority queue of size "rank" to keep track of the frequency. if (wordcount > heap.peek()) { heap.addAtPositionZero(word); } } }
Complexity is O(p)
Complexity: ø ( n log n / m ) + O (n + p ) + O(p) effectively : ø ( n log n / m )