I have hundreds of gigabytes of data in binary files. I want to take a random sample of the data, reading several consecutive records at random many times.
The data is stored in many files. The main files do not store the data in any particular order, so each one has a sorted index file. My current code is something like this, except that there are many files:
index = open("foo.index", 'rb')
data = open("foo", 'rb')
index_offset_format = 'Q'
index_offset_size = struct.calcsize(index_offset_format)
record_set = []
for _ in range(n_batches):
# Read `batch_size` offsets from the index - these are consecutive,
# so they can be read in one operation
index_offset_start = random.randint(0, N_RECORDS - batch_size)
index.seek(index_offset_start)
data_offsets = struct.iter_unpack(
index_offset_format,
index.read(index_offset_size * batch_size))
# Read actual records from data file. These are not consecutive
records = []
for offset in data_offsets:
data.seek(offset)
records.append(data.read(RECORD_SIZE))
record_set.append(records)
Then other things are done with the records. From profiling, I see that program is heavily IO-bound, and most of the time is spent in index.read
and data.read
. I suspect this is because read
is blocking: the interpreter waits for the OS to read the data from disk before asking for the next random chunk of data, so the OS has no opportunity to optimise the disk access pattern. So: is there some file API that I can pass a batch of instructions to? Something like:
def read_many(file, offsets, lengths):
'''
@param file: the file to read from
@param offsets: the offsets to seek to
@param lengths: the lengths of data to read
@return an iterable over the file contents at the requested offsets
'''
Alternatively, would it be enough to open several file objects and request multiple reads using multithreading? Or would the GIL prevent that from being useful?