0

I have very big text file, with dozens of millions of words, one word per line. I need to find top 10 most common words in that file. There is some restrictions: usage of only standard library and usage of less that 1 KB of memory.

It is guaranteed that any 10 words in that file is short enough to fit into said memory limit and there will be enough memory to some other variables such as counters etc.

The only solution I come with is to use another text file as additional memory and buffer. But, it seems to be bad and slow way to deal with that problem.

Are there any better and efficient solutions?

Darth-CodeX
  • 2,166
  • 1
  • 6
  • 23
Sergey U
  • 5
  • 3
  • 1
    If you're willing to accept extremely long run times, there is a very trivial and simple O(n^2) implementation. – abelenky May 14 '22 at 14:20
  • Your stack will consume at least 4 KiB of RAM, so even if you have no code at all you'll never be able to meet the "less than 1 KiB of memory usage" requirement. – Brendan May 16 '22 at 08:03

2 Answers2

2

You can first sort this file (it is possible with limited memory, but will require disk IO of course - see How do I sort very large files as starter).

Then you will be able to read sorted file line by line and calculate frequency of each word one by one - store them, after 10 words - if frequency is higher then all stored in your array - add it to internal array and remove least occurred one, thus you will keep only 10 most frequent words in memory during this stage.

As @John Bollinger mentioned - if your requirment is to print all top 10 words, if for example - all words from files have the same frequency, i.e. they all are "top", then this approach will not work, you need to calculate frequency for each word, store in file, sort it and then print top 10 including all words with the same frequency as 10th one.

Iłya Bursov
  • 23,342
  • 4
  • 33
  • 57
  • Potential issue with this approach: there are many words tied for tenth most frequent. If the requirement in that case is to print all of the words tied for that frequency, then it might well not be possible to hold them all in memory at once. There are workarounds, of course. – John Bollinger May 14 '22 at 14:44
  • @JohnBollinger not really, check problem description `It is guaranteed that any 10 words in that file is short enough to fit into said memory limit ` – Iłya Bursov May 14 '22 at 14:46
  • That any 10 of the words can fit in memory at once does not mitigate the issue. It's not about long words, but rather about needing to hold more than ten in memory at the same time. – John Bollinger May 14 '22 at 14:49
  • @JohnBollinger no, to print `top 10 most common words` you don't need to hold all most common, if for example all words from file have the same frequency - you still just print 10, in my approach it will be first 10 alphabetically – Iłya Bursov May 14 '22 at 14:55
  • Well *that's* a question of the interpretation of the problem. As I wrote, "If the requirement in that case is to print all of the words tied for that frequency" then that is a problem for this suggestion. It is by no means clear to me which is the required behavior when "top 10 most common words" does not characterize a unique set of exactly 10 words. – John Bollinger May 14 '22 at 15:06
  • @JohnBollinger got it, update my answer to cover this case as well – Iłya Bursov May 14 '22 at 15:10
0

If you can create a new file however big, you can create a simple disk-based tree database holding each word and its frequency so far. This will cost you O(log n) each time, with n from 1 to N words, plus the final scan of the whole N-sized tree, which adds up to O(N log N).

If you cannot create a new file, you'll need to perform a in-place sorting of the whole file, which will cost about O(N2). That's closer to O((N/k)2), I think, with k the average number of words you can keep in memory for the simplest bubble-sort - but that is O(1/k2)O(N2) = K O(N2) which is still O(N2). At that point you can rescan the file one final time, and after each run of each word you'll know whether that word can enter your top ten, and at which position. So you need to fit just twelve words in memory (the top ten, the current word, and the word just read from the file). 1K should be enough.

So, the auxiliary file is actually the fastest option.

LSerni
  • 55,617
  • 10
  • 65
  • 107