0

I'm currently doing an exercise on file handling where I need to read a specific line in a huge text file given the line number (multi-gigabyte file containing ASCII characters only).

What I have done so far:

Since the file does not have lines of equal size, I processed the file to create a Hashmap of offsets of each line corresponding to the line number (Cold-start). Subsequently, the offset is then used to seek the random access file (the given text file). The problem with this approach is that for really huge file sizes, the memory falls short to accommodate the Hashmap and furthermore the I/O becomes the bottleneck. My system specs are : 8GB DDR2, 1TB(SATA2) 7200RPM, 4-64 bit cores. Let's assume that the entire hardware is at my disposal.

What I intend to do

In order to minimize the latency, I intend to index all the lines in the text-file into pages and use Least Recently Used page replacement policy to page-in the necessary pages into memory depending on whether the page containing the requested line number is present in the memory or not. (Just like a cache) I understand that a lot depends on the page size and the number of resident pages in the memory but I really needed to know if I'm heading in a more meaningful direction or if it is just going to be an overkill?

Thanks for all your help and suggestions.

shark1608
  • 649
  • 11
  • 24
  • Do you have to access the file by line number or what criteria? – Smutje Aug 20 '14 at 08:55
  • Yes, the file is accessed by the line number. Thank for the clarification, I've edited the post – shark1608 Aug 20 '14 at 09:01
  • Before you think about a caching strategy: Is it common that subsequent reads access the same line numbers more than once? – Smutje Aug 20 '14 at 09:10
  • There is no such assumption. My idea was driven by the fact that even if a million lines stored in the memory in pages generate a small percentage of successful reads, the latency due to I/O can be reduced. – shark1608 Aug 20 '14 at 10:12

1 Answers1

3

Hashmap of offsets of each line

And that's the problem. The map may easily eat more memory than the file itself.

Storing the offsets of every e.g. 1024th line would need 1024 times less memory. Reading a single line or a thousand from a disk takes about the same time because of:

  • the smallest accessible unit is a block (used to be 512 B, but is 4KiB now)
  • the transfer time for a block is negligible when compared to the seek time

So round down to a multiple of 1024, find the line and process sequentially to the line you really wanted.


You're using a HashMap<Integer, Integer>, which takes a lot of memory for the boxed ints. But your keys build a sequence, so why don't you use List<Integer> or even an int[]? You can reduce the memory consumption by a factor of maybe 4.

I'd go for List<int[]> offsets accessed as follows:

int getOffset(int line) {
    int listIndex = line >> 20; // one list entry per 1M lines
    int arrayIndex = (line >> 10) & 1023; one array entry per 1k lines
    int remaining = line & 1023; // lines to skip

    // TODO: handling of index bounds
    int offset = offsets.get(listIndex)[arrayIndex];
    **seek to offset
    **skip remaining lines
}

You'll need 4 bytes (an int) for every 1024 lines and some more memory for every 1MiB. Unless you text file is multi-terrabyte, you need no LRU as everything nicely fits.

For huge files replace int by long in the above text.

maaartinus
  • 44,714
  • 32
  • 161
  • 320
  • 1
    +1 for the elegant solution. An extended explanation of memory overhead of using Hashmap can be found here : http://stackoverflow.com/questions/1526596/memory-overhead-of-java-hashmap-compared-to-arraylist – shark1608 Aug 20 '14 at 09:58