1

I have a binary file consisting of a header, then a list of records. Each record is a list of n unsigned integers followed by a list of n signed integers. n is different for each record. The unsigned integers are values for a variable x, and the signed integers are values for a variable y. For instance, the data might look like this

4 3 1 5 -10 50 40 30
4 5 6 30 -25 100
5 4 30 55

6 4 6 7 4 -20 30 30 -50 60

I need to read this file and store each record into a (n, 2) numpy array where the first column contains the x values and the second column contains the y values. Then, I want to store all the arrays into a container numpy array. In the end, I should have something like

np.array([
    np.array([[4, -10], [3, 50], [1, 40], [5, 30]]),
    np.array([[4, 30], [5, -25], [6, 100]]),
    np.array([[5, 30], [4, 55]]),
    array([], shape=(0, 2)),
    np.array([[6, -20], [4, 30], [6, 30], [7, -50], [4, 60]])
])

The binary file is a about 200 MB and contains about 10 million records. The length of each record is provided in the header data and I have it in a lengths array. Note that some records might be empty, indicated by a length of 0.

Right now, I have a working solution using struct.unpack, but it is really slow (reading the file takes a couple of minutes). I iterate over the lengths array and run the read_single_record function.

def read_single_record(f, length, x_fmt, y_fmt):
    """x_fmt and y_fmt are tuples of
    (struct format character, size in bytes).
    """
    raw = f.read(length)
    n_vals = length // (x_fmt[1] + y_fmt[1])
    val = struct.unpack(
        f'{n_vals}' + x_fmt[0] + f'{n_vals}' + y_fmt[0], raw)
    val = np.array(val, dtype=np.int16).reshape((n_vals, 2), order='F')

    # do some more data processing
    # ...

    return val

records = np.empty(nb_records, dtype='object')
for i, length in enumerate(lengths):
    records[i] = read_single_record(f, length, x_fmt, y_fmt)

I have seen others suggest using np.fromfile, but I am not quite sure how to use it with records of variable length.

What is the most efficient way to read records of variable length from my binary file?

Loïc Séguin-C.
  • 1,806
  • 2
  • 13
  • 14
  • Don't use a `numpy` array to store jagged arrays, use a different data structure – user3483203 Sep 27 '18 at 15:59
  • I doubt if the object array contain has an any advantages over a plain list. But the real issue is whether `fromfile` can handle the unpacking any better. You may be able to give it a compound `dtype` that is functionally equivalent. But since a record consists of a mix of variable length unsigned and signed, I don't see how you can avoid reading one record at a time. – hpaulj Sep 27 '18 at 16:24
  • In response to @user3483203, the reason I used a numpy array to store the jagged arrays is that it provides some useful functionality later on in my code. For this question though, it is not really important and the container could very well be a list. – Loïc Séguin-C. Sep 27 '18 at 16:37

0 Answers0