Here is a simple example to illustrate my problem: I have a large binary file with 10 million values.
I want to get 5K values from certain points in this file.
I have a list of indexes giving me the exact place in the file I have my value in.
To solve this I tried two methods:
Going through the values and simply using
seek()
(from the start of the file) to get each value, something like this:binaryFile_new = open(binary_folder_path, "r+b") for index in index_list: binaryFile_new.seek (size * (index), 0) wanted_line = binaryFile_new.read (size) wanted_line_list.append(wanted_line) binaryFile_new.close()
But as I understand this solution reads through from the beginning for each index, therefore the complexity is O(N**2) in terms of file size.
Sorting the indexes so I could go through the file "once" while seeking from the current position with something like that:
binaryFile_new = open(binary_folder_path, "r+b") sorted_index_list = sorted(index_list) for i, index in enumerate(sorted_index_list): if i == 0: binaryFile_new.seek (size * (v), 0) else: binaryFile_new.seek ((index - sorted_index_list[i-1]) * size - size, 1) binaryFile_new.seek (size * (index), 0) wanted_line = binaryFile_new.read (size) wanted_line_list.append(wanted_line) binaryFile_new.close()
I expected the second solution to be much faster because in theory it would go through the whole file once O(N).
But for some reason both solutions run the same.
I also have a hard constraint on memory usage, as I run this operation in parallel and on many files, so I can't read files into memory.
Maybe the mmap
package will help? Though, I think mmap
also scans the entire file until it gets to the index so it's not "true" random access.