Efficiently reading few lines from a very large binary file

Question

Here is a simple example to illustrate my problem: I have a large binary file with 10 million values.

I want to get 5K values from certain points in this file.

I have a list of indexes giving me the exact place in the file I have my value in.

To solve this I tried two methods:

Going through the values and simply using seek() (from the start of the file) to get each value, something like this:
```
binaryFile_new = open(binary_folder_path, "r+b")
for index in index_list:
    binaryFile_new.seek (size * (index), 0)
    wanted_line = binaryFile_new.read (size)
    wanted_line_list.append(wanted_line)
binaryFile_new.close()
```
But as I understand this solution reads through from the beginning for each index, therefore the complexity is O(N**2) in terms of file size.

Sorting the indexes so I could go through the file "once" while seeking from the current position with something like that:

binaryFile_new = open(binary_folder_path, "r+b")
sorted_index_list = sorted(index_list)
for i, index in enumerate(sorted_index_list):
        if i == 0:
                binaryFile_new.seek (size * (v), 0)
            else:
                binaryFile_new.seek ((index - sorted_index_list[i-1]) * size - size, 1)
    binaryFile_new.seek (size * (index), 0)
    wanted_line = binaryFile_new.read (size)
    wanted_line_list.append(wanted_line)
binaryFile_new.close()

I expected the second solution to be much faster because in theory it would go through the whole file once O(N).

But for some reason both solutions run the same.

I also have a hard constraint on memory usage, as I run this operation in parallel and on many files, so I can't read files into memory.

Maybe the mmap package will help? Though, I think mmap also scans the entire file until it gets to the index so it's not "true" random access.

Possible duplicate of [Reading binary file and looping over each byte](https://stackoverflow.com/questions/1035340/reading-binary-file-and-looping-over-each-byte) — TemporalWolf, Aug 08 '19 at 18:08
If you chunk it (as in the dupe target) then you only load 1 chunk at a time. Pick a chunk size that makes sense (and potentially add logic for recovering data which spans multiple chunks) and you'll never come close to hitting your RAM limit. — TemporalWolf, Aug 08 '19 at 18:09
I think the `mmap` module would help because its implementation will use the OS's memory paging (which is likely very efficient) to effectively do any "seeking" needed. — martineau, Aug 08 '19 at 18:14

Nickolay · Accepted Answer · 2019-08-09T02:35:32.720

I'd go with #1:

for index in index_list:
    binary_file.seek(size * index)
    # ...

(I cleaned up your code a bit to comply with Python naming conventions and to avoid using a magic 0 constant, as SEEK_SET is default anyway.)

as I understand this solution reads through from the beginning for each index, therefore the complexity is O(N**2) in terms of file size.

No, a seek() does not "read through from the beginning", that would defeat the point of seeking. Seeking to the beginning of file and to the end of file have roughly the same cost.

Sorting the indexes so I could go through the file "once" while seeking from the current position

I can't quickly find a reference for this, but I believe there's absolutely no point in calculating the relative offset in order to use SEEK_CUR instead of SEEK_SET.

There might be a small improvement just from seeking to the positions you need in order instead of randomly, as there's an increased chance your random reads will be serviced from cache, in case many of the points you need to read happen to be close to each other (and so your read patterns trigger read-ahead in the file system).

Maybe the mmap package will help? Though, I think mmap also scans the entire file until it gets to the index so it's not "true" random access.

mmap doesn't scan the file. It sets up a region in your program's virtual memory to correspond to the file, so that accessing any page from this region the first time leads to a page fault, during which the OS reads that page (several KB) from the file (assuming it's not in the page cache) before letting your program proceed.

The internet is full of discussions of relative merits of read vs mmap, but I recommend you don't bother with trying to optimize by using mmap and use this time to learn about the virtual memory and the page cache.

[edit] reading in chunks larger than the size of your values might save you a bit of CPU time in case many of the values you need to read are in the same chunk (which is not a given) - but unless your program is CPU bound in production, I wouldn't bother with that either.

Hi @Nickolay, thanks for the detailed answer. My limited understanding comes from my previous question https://stackoverflow.com/questions/56629602/python-mmap-slow-access-to-end-of-files-with-test-code/56630370?noredirect=1#comment99833276_56630370 And from some emails with a few community members, I couldn't find any easily available details about seek without diving into details. So I just assumed that's why it slows down with larger files. — artembus, Aug 09 '19 at 07:11
Regarding the problem itself - I suspect the actual problem is with the cache and loading of pages. When I run my script it loads a bunch of values from several files then it just hangs for a few seconds, and this cycle repeats. This feels like some memory gets filled and then python/kernel need some time to empty the buffer in order to read/load new pages. I hoped loading values along the file would help but apparently it doesn't. In my current state all the values are dispersed throughout the file. If I won't find any solution for that I'll need to rewrite the data to group values together. — artembus, Aug 09 '19 at 07:18
Since the "problem itself" was described in another question, I posted an answer there as well. There shouldn't be hanging, but it seems that you're expecting your disk to work as fast as your RAM. — Nickolay, Aug 09 '19 at 11:01

Efficiently reading few lines from a very large binary file

1 Answers1

Linked