0

I have large binary data files that have a predefined format, originally written by a Fortran program as little endians. I would like to read these files in the fastest, most efficient manner, so using the array package seemed right up my alley as suggested in Improve speed of reading and converting from binary file?.

The problem is the pre-defined format is non-homogeneous. It looks something like this: ['<2i','<5d','<2i','<d','<i','<3d','<2i','<3d','<i','<d','<i','<3d']

with each integer i taking up 4 bytes, and each double d taking 8 bytes.

Is there a way I can still use the super efficient array package (or another suggestion) but with the right format?

martineau
  • 119,623
  • 25
  • 170
  • 301
boof
  • 339
  • 1
  • 3
  • 13
  • What do you cal a "large binary data files" ? you mean that in your original file all data are not format in the same way ? how do you know foramt of each one ? – Dadep Jul 05 '17 at 18:42
  • The binary files contain anywhere between ~1000 to ~1,000,000 lines of data each from a physical simulation. For testing purposes, I am using files that contain only about 40,000 lines of data. I know the format to each one because I have the original Fortran code and can see what type and how large each data are in memory. – boof Jul 05 '17 at 19:07
  • What are '<5d' and `<3d'? When you say *"... each double `d` taking 8 bytes"*, what do you mean by "double d"? Is each `d` 4 bytes? Please clarify. – AGN Gazer Jul 05 '17 at 19:32
  • What do you mean by "lines" of data in a binary file? Is the file essentially an collection of structures, each in a predefined format such as shown in your question? – martineau Jul 06 '17 at 18:34

5 Answers5

4

Use struct. In particular, struct.unpack.

result = struct.unpack("<2i5d...", buffer)

Here buffer holds the given binary data.

ForceBru
  • 43,482
  • 10
  • 63
  • 98
  • I took a look at using struct, and that's what I originally fell back on, but I read that the array package is faster. Was I mislead?? – boof Jul 05 '17 at 19:03
  • 2
    @boof, `array.array` holds a sequence of elements _of the same type_, so processing data `struct`ured like this at once is impossible with it. – ForceBru Jul 05 '17 at 19:06
4

It's not clear from your question whether you're concerned about the actual file reading speed (and building data structure in memory), or about later data processing speed.

If you are reading only once, and doing heavy processing later, you can read the file record by record (if your binary data is a recordset of repeated records with identical format), parse it with struct.unpack and append it to a [double] array:

from functools import partial

data = array.array('d')
record_size_in_bytes = 9*4 + 16*8   # 9 ints + 16 doubles

with open('input', 'rb') as fin:
    for record in iter(partial(fin.read, record_size_in_bytes), b''):
        values = struct.unpack("<2i5d...", record)
        data.extend(values)

Under assumption you are allowed to cast all your ints to doubles and willing to accept increase in allocated memory size (22% increase for your record from the question).

If you are reading the data from file many times, it could be worthwhile to convert everything to one large array of doubles (like above) and write it back to another file from which you can later read with array.fromfile():

data = array.array('d')
with open('preprocessed', 'rb') as fin:
    n = os.fstat(fin.fileno()).st_size // 8
    data.fromfile(fin, n)

Update. Thanks to a nice benchmark by @martineau, now we know for a fact that preprocessing the data and turning it into an homogeneous array of doubles ensures that loading such data from file (with array.fromfile()) is ~20x to ~40x faster than reading it record-per-record, unpacking and appending to array (as shown in the first code listing above).

A faster (and a more standard) variation of record-by-record reading in @martineau's answer which appends to list and doesn't upcast to double is only ~6x to ~10x slower than array.fromfile() method and seems like a better reference benchmark.

randomir
  • 17,989
  • 1
  • 40
  • 55
  • Casting all the `int`s into `double`s would increase the amount of memory needed which might be important if the file is large and can't be processed a line-at-a-time—although I'm not sure what a "line" is in a binary file. – martineau Jul 06 '17 at 18:05
  • 1
    A 22% increase in this case, yes. Also, on second thought I was interpreting "lines" from the OP too literal. It makes much more sense to assume a fixed-size record, and a file as a sequence of such records (non-newline delimited). I've edited my answer to reflect this. – randomir Jul 06 '17 at 20:16
  • Besides using more memory and possibly requiring a preprocessing step, your approach of using `array` does _not_ appear to be faster than using `struct.upack()`, as suggested by @ForceBru. See the [answer I posted](https://stackoverflow.com/a/45019271/355230) for details. – martineau Jul 10 '17 at 18:45
  • But you are not using preprocessed file in your `using_preprocessed_file`. Can you fix it, I'm really curious is `array` that much efficient as the OP said. – randomir Jul 10 '17 at 19:13
  • You're right, good catch...although I should have caught something so obvious myself `:-(`. I've corrected the issue and it indeed does appear that preprocessing the data and turning it into an homogeneous `array` of doubles _is_ in fact the fastest way to do things with stock Python. Immensely sorry for the confusion (and earlier bogus comment). P.S. You need to change the one line to `n = os.fstat(fin.fileno()).st_size // 8` so it works in Python 3 as well as 2. – martineau Jul 10 '17 at 20:08
  • @martineau, as they say, evidence is the king. I'm glad you did this, thanks. I'll update my answer with your results. – randomir Jul 10 '17 at 20:31
  • Some clarification. for the record: Creating a preprocessed `array`-like file is ~20x to ~40x faster than reading it in, record-by-record, and converting each chunk into an `array` and tacking it on to the bigger array being built. The speed difference is not nearly that much when compared to reading it in all-at-once or piecemeal, and then converting either of those into a `list` of `tuples`, one per record—which is what I would consider to be the "standard" or common way of doing it (and which allows individual `tuples` to be quickly accessed by index). – martineau Jul 10 '17 at 21:47
  • 1
    You're right, I believe that was OP's initial approach. I've updated my update. :) – randomir Jul 10 '17 at 23:08
3

Major Update: Modified to use proper code for reading in a preprocessed array file (function using_preprocessed_file() below), which dramatically changed the results.

To determine what method is faster in Python (using only built-ins and the standard libraries), I created a script to benchmark (via timeit) the different techniques that could be used to do this. It's a bit on the longish side, so to avoid distraction, I'm only posting the code tested and related results. (If there's sufficient interest in the methodology, I'll post the whole script.)

Here are the snippets of code that were compared:

@TESTCASE('Read and constuct piecemeal with struct')
def read_file_piecemeal():
    structures = []
    with open(test_filenames[0], 'rb') as inp:
        size = fmt1.size
        while True:
            buffer = inp.read(size)
            if len(buffer) != size:  # EOF?
                break
            structures.append(fmt1.unpack(buffer))
    return structures

@TESTCASE('Read all-at-once, then slice and struct')
def read_entire_file():
    offset, unpack, size = 0, fmt1.unpack, fmt1.size
    structures = []
    with open(test_filenames[0], 'rb') as inp:
        buffer = inp.read()  # read entire file
        while True:
            chunk = buffer[offset: offset+size]
            if len(chunk) != size:  # EOF?
                break
            structures.append(unpack(chunk))
            offset += size

    return structures

@TESTCASE('Convert to array (@randomir part 1)')
def convert_to_array():
    data = array.array('d')
    record_size_in_bytes = 9*4 + 16*8   # 9 ints + 16 doubles (standard sizes)

    with open(test_filenames[0], 'rb') as fin:
        for record in iter(partial(fin.read, record_size_in_bytes), b''):
            values = struct.unpack("<2i5d2idi3d2i3didi3d", record)
            data.extend(values)

    return data

@TESTCASE('Read array file (@randomir part 2)', setup='create_preprocessed_file')
def using_preprocessed_file():
    data = array.array('d')
    with open(test_filenames[1], 'rb') as fin:
        n = os.fstat(fin.fileno()).st_size // 8
        data.fromfile(fin, n)
    return data

def create_preprocessed_file():
    """ Save array created by convert_to_array() into a separate test file. """
    test_filename = test_filenames[1]
    if not os.path.isfile(test_filename):  # doesn't already exist?
        data = convert_to_array()
        with open(test_filename, 'wb') as file:
            data.tofile(file)

And here were the results running them on my system:

Fastest to slowest execution speeds using Python 3.6.1
(10 executions, best of 3 repetitions)
Size of structure: 164
Number of structures in test file: 40,000
file size: 6,560,000 bytes

     Read array file (@randomir part 2): 0.06430 secs, relative  1.00x (   0.00% slower)
Read all-at-once, then slice and struct: 0.39634 secs, relative  6.16x ( 516.36% slower)
Read and constuct piecemeal with struct: 0.43283 secs, relative  6.73x ( 573.09% slower)
    Convert to array (@randomir part 1): 1.38310 secs, relative 21.51x (2050.87% slower)

Interestingly, most of the snippets are actually faster in Python 2...

Fastest to slowest execution speeds using Python 2.7.13
(10 executions, best of 3 repetitions)
Size of structure: 164
Number of structures in test file: 40,000
file size: 6,560,000 bytes

     Read array file (@randomir part 2): 0.03586 secs, relative  1.00x (   0.00% slower)
Read all-at-once, then slice and struct: 0.27871 secs, relative  7.77x ( 677.17% slower)
Read and constuct piecemeal with struct: 0.40804 secs, relative 11.38x (1037.81% slower)
    Convert to array (@randomir part 1): 1.45830 secs, relative 40.66x (3966.41% slower)
martineau
  • 119,623
  • 25
  • 170
  • 301
0

Take a look at the documentation for numpy's fromfile function: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.fromfile.html and https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html#arrays-dtypes-constructing

Simplest example:

import numpy as np
data = np.fromfile('binary_file', dtype=np.dtype('<i8, ...'))

Read more about "Structured Arrays" in numpy and how to specify their data type(s) here: https://docs.scipy.org/doc/numpy/user/basics.rec.html#

AGN Gazer
  • 8,025
  • 2
  • 27
  • 45
0

There's a lot of good and helpful answers here, but I think the best solution needs more explaining. I implemented a method that reads the entire data file in one pass using the built-in read() and constructs a numpy ndarray all at the same time. This is more efficient than reading the data and constructing the array separately, but it's also a bit more finicky.

line_cols = 20              #For example
line_rows = 40000           #For example
data_fmt = 15*'f8,'+5*'f4,' #For example (15 8-byte doubles + 5 4-byte floats)
data_bsize = 15*8 + 4*5     #For example
with open(filename,'rb') as f:
        data = np.ndarray(shape=(1,line_rows),
                          dtype=np.dtype(data_fmt),
                          buffer=f.read(line_rows*data_bsize))[0].astype(line_cols*'f8,').view(dtype='f8').reshape(line_rows,line_cols)[:,:-1]

Here, we open the file as a binary file using the 'rb' option in open. Then, we construct our ndarray with the proper shape and dtype to fit our read buffer. We then reduce the ndarray into a 1D array by taking its zeroth index, where all our data is hiding. Then, we reshape the array using np.astype, np.view and np.reshape methods. This is because np.reshape doesn't like having data with mixed dtypes, and I'm okay with having my integers expressed as doubles.

This method is ~100x faster than looping line-for-line through the data, and could potentially be compressed down into a single line of code.

In the future, I may try to read the data in even faster using a Fortran script that essentially converts the binary file into a text file. I don't know if this will be faster, but it may be worth a try.

boof
  • 339
  • 1
  • 3
  • 13