I have a binary file consisting of a header, then a list of records. Each record is a list of n unsigned integers followed by a list of n signed integers. n is different for each record. The unsigned integers are values for a variable x, and the signed integers are values for a variable y. For instance, the data might look like this
4 3 1 5 -10 50 40 30
4 5 6 30 -25 100
5 4 30 55
6 4 6 7 4 -20 30 30 -50 60
I need to read this file and store each record into a (n, 2) numpy array where the first column contains the x values and the second column contains the y values. Then, I want to store all the arrays into a container numpy array. In the end, I should have something like
np.array([
np.array([[4, -10], [3, 50], [1, 40], [5, 30]]),
np.array([[4, 30], [5, -25], [6, 100]]),
np.array([[5, 30], [4, 55]]),
array([], shape=(0, 2)),
np.array([[6, -20], [4, 30], [6, 30], [7, -50], [4, 60]])
])
The binary file is a about 200 MB and contains about 10 million records. The length of each record is provided in the header data and I have it in a lengths
array. Note that some records might be empty, indicated by a length of 0.
Right now, I have a working solution using struct.unpack, but it is really slow (reading the file takes a couple of minutes). I iterate over the lengths
array and run the read_single_record
function.
def read_single_record(f, length, x_fmt, y_fmt):
"""x_fmt and y_fmt are tuples of
(struct format character, size in bytes).
"""
raw = f.read(length)
n_vals = length // (x_fmt[1] + y_fmt[1])
val = struct.unpack(
f'{n_vals}' + x_fmt[0] + f'{n_vals}' + y_fmt[0], raw)
val = np.array(val, dtype=np.int16).reshape((n_vals, 2), order='F')
# do some more data processing
# ...
return val
records = np.empty(nb_records, dtype='object')
for i, length in enumerate(lengths):
records[i] = read_single_record(f, length, x_fmt, y_fmt)
I have seen others suggest using np.fromfile, but I am not quite sure how to use it with records of variable length.
What is the most efficient way to read records of variable length from my binary file?