I'm currently doing an exercise on file handling where I need to read a specific line in a huge text file given the line number (multi-gigabyte file containing ASCII characters only).
What I have done so far:
Since the file does not have lines of equal size, I processed the file to create a Hashmap of offsets of each line corresponding to the line number (Cold-start). Subsequently, the offset is then used to seek the random access file (the given text file). The problem with this approach is that for really huge file sizes, the memory falls short to accommodate the Hashmap and furthermore the I/O becomes the bottleneck. My system specs are : 8GB DDR2, 1TB(SATA2) 7200RPM, 4-64 bit cores. Let's assume that the entire hardware is at my disposal.
What I intend to do
In order to minimize the latency, I intend to index all the lines in the text-file into pages and use Least Recently Used page replacement policy to page-in the necessary pages into memory depending on whether the page containing the requested line number is present in the memory or not. (Just like a cache) I understand that a lot depends on the page size and the number of resident pages in the memory but I really needed to know if I'm heading in a more meaningful direction or if it is just going to be an overkill?
Thanks for all your help and suggestions.