Dramatic drop in numpy fromfile performance when switching from python 2 to python 3

Question

Background

I am analyzing large (between 0.5 and 20 GB) binary files, which contain information about particle collisions from a simulation. The number of collisions, number of incoming and outgoing particles can vary, so the files consist of variable length records. For analysis I use python and numpy. After switching from python 2 to python 3 I have noticed a dramatic decrease in performance of my scripts and traced it down to numpy.fromfile function.

Simplified code to reproduce the problem

This code, iotest.py

Generates a file of a similar structure to what I have in my studies
Reads it using numpy.fromfile
Reads it using numpy.frombuffer
Compares timing of both

    import numpy as np
    import os
    
    def generate_binary_file(filename, nrecords):
        n_records = np.random.poisson(lam = nrecords)
        record_lengths = np.random.poisson(lam = 10, size = n_records).astype(dtype = 'i4')
        x = np.random.normal(size = record_lengths.sum()).astype(dtype = 'd')
        with open(filename, 'wb') as f:
            s = 0
            for i in range(n_records):
                f.write(record_lengths[i].tobytes())
                f.write(x[s:s+record_lengths[i]].tobytes())
                s += record_lengths[i]
            # Trick for testing: make sum of records equal to 0
            f.write(np.array([1], dtype = 'i4').tobytes())
            f.write(np.array([-x.sum()], dtype = 'd').tobytes())
        return os.path.getsize(filename)
    
    def read_binary_npfromfile(filename):
        checksum = 0.0
        with open(filename, 'rb') as f:
            while True:
                try:
                    record_length = np.fromfile(f, 'i4', 1)[0]
                    x = np.fromfile(f, 'd', record_length)
                    checksum += x.sum()
                except:
                    break
        assert(np.abs(checksum) < 1e-6)

    def read_binary_npfrombuffer(filename):
        checksum = 0.0
        with open(filename, 'rb') as f:
            while True:
                try:
                    record_length = np.frombuffer(f.read(np.dtype('i4').itemsize), dtype = 'i4', count = 1)[0]
                    x = np.frombuffer(f.read(np.dtype('d').itemsize * record_length), dtype = 'd', count = record_length)
                    checksum += x.sum()
                except:
                    break
        assert(np.abs(checksum) < 1e-6)
    
    
    if __name__ == '__main__':
        from timeit import Timer
        from functools import partial
    
        fname = 'testfile.tmp'
        print("# File size[MB], Timings and errors [s]: fromfile, frombuffer")
        for i in [10**3, 3*10**3, 10**4, 3*10**4, 10**5, 3*10**5, 10**6, 3*10**6]:
            fsize = generate_binary_file(fname, i)
            t1 = Timer(partial(read_binary_npfromfile, fname))
            t2 = Timer(partial(read_binary_npfrombuffer, fname))
            a1 = np.array(t1.repeat(5, 1))
            a2 = np.array(t2.repeat(5, 1))
            print('%8.3f %12.6f %12.6f %12.6f %12.6f' % (1.0 * fsize / (2**20), a1.mean(), a1.std(), a2.mean(), a2.std()))

Results

Conclusions

In Python 2 numpy.fromfile was probably the fastest way to deal with binary files of variable structure. It was approximately 3 times faster than numpy.frombuffer. Performance of both scaled linearly with file size.

In Python 3 numpy.frombuffer became around 10% slower, while numpy.fromfile became around 9.3 times slower compared to Python 2! Performance of both still scales linearly with file size.

In the documentation of numpy.fromfile it is described as "A highly efficient way of reading binary data with a known data-type". It is not correct in Python 3 anymore. This was in fact noticed earlier by other people already.

Questions

In Python 3 how to obtain a comparable (or better) performance to Python 2, when reading binary files of variable structure?
What happened in Python 3 so that numpy.fromfile became an order of magnitude slower?

There is a somewhat related question about numpy fromfile versus frombuffer https://stackoverflow.com/questions/52165652/reading-a-structured-binary-file-with-numpy-fromfile-vs-read-frombuffer. Turns out my little research gives an answer to it. — Dmytro Oliinychenko, Mar 09 '22 at 15:35
What has changed from Python 2 to Python 3 is the buffering policy, see [`open` (2.7)](https://docs.python.org/2/library/functions.html#open) vs. [`open` (3.10)](https://docs.python.org/3/library/functions.html#open). So I thought it might help to set `buffering=os.path.getsize(filename)` but that didn't change anything for me. What helped on the other hand was to read the entire file into a `io.BytesIO()` object and then use `np.fromfile` on that object. This gave similar performance for both 2.7 and 3.9. — a_guest, Mar 09 '22 at 20:27
Also note that on Python 2, the object returned by `open` is a thin wrapper around C-level functions such as `fread` and `fwrite` while on Python 3 this is hidden behind the more complex [`io.Buffered(Reader|Writer)`](https://docs.python.org/3/library/io.html#io.BufferedReader) implementation, though I can't tell how much extra work they'll do in this case. — a_guest, Mar 09 '22 at 20:27
You can use `strace --trace=read python test.py` in order to see the read system calls and on my system with Python 3 it performed *a lot* more read calls than with Python 2. I'm not sure what causes this though. This happens even with `buffering=os.path.getsize(filename)` so Python 3 still seems to fill that buffer differently, i.e. not reading the whole file at once but somehow in chunks. This seems to agree with the results when using `io.BytesIO` where the program is forced to read the entire file in one go and there is no performance difference between 2.7 and 3.9. — a_guest, Mar 09 '22 at 20:46
@a_guest How did you use np.fromfile on io.BytesIO()? I tried and got an error "io.UnsupportedOperation: fileno" It looks like the corresponding numpy issue [ https://github.com/numpy/numpy/issues/2230 ] is hanging since 2012. — Dmytro Oliinychenko, Mar 09 '22 at 22:00
It worked for me without issue, I just did `b = io.BytesIO(); b.write(f.read()); b.seek(0); f = b;` and then proceed with the rest of the code. I tested with Numpy 1.16.6 which seems to be the latest version available for Python 2.7. — a_guest, Mar 09 '22 at 22:47
Well, I just realized that actually it didn't work, the `UnsupportedOperation` exception was just consumed by the bare `except` clause in the test functions... — a_guest, Mar 09 '22 at 22:54
Anyway, I checked number of `read` calls via `strace` for both Python versions and `np.fromfile` as well as `np.frombuffer` (subtracting number of `read` calls originating only from imports). This is what I get: Python 2.7: `read_binary_npfromfile: 2067, read_binary_npfrombuffer: 2067`; Python 3.9: `read_binary_npfromfile: 399984, read_binary_npfrombuffer: 2068`. I use a test file with `nrecords=10**5` (fixed `np.random.seed(0)`); when I call `os.path.getsize(filename)` I get `8406416`. — a_guest, Mar 09 '22 at 23:12
I subclassed `io.BytesIO` to resolve the `fileno` issue; here I just added a method `def fileno(self): return self._fileno` where `self._fileno = f.fileno()`, i.e. it is set to the actual file descriptor. Then I also log all attribute access to that new BytesIO object and I realized it performs a lot of `seek` and `tell` calls, namely `200089`; it performs twice as many (`400178`) accesses to `fileno`. There is not a single access to `read` so it seems that `np.fromfile` uses directly a low-level function on the file descriptor. That would explain why increasing buffering has no effect. — a_guest, Mar 10 '22 at 08:37
Comparing these numbers a pattern seems to appear. There are `1e5` records in the file, i.e. there will be `2e5` queries to `np.fromfile`. This can explain the number of `tell`s which are probably used to limit the max. number of elements read (similar to [this](https://github.com/numpy/numpy/blob/v1.16.6/numpy/core/records.py#L728)). This would also use half of the `fileno` accesses for `fstat` and the other half is used for doing reading at a low level. Since Python 2.7 does its buffering also at low-level, this translates into a reasonable amount of `read` system calls (~4k bytes each) ... — a_guest, Mar 10 '22 at 10:03
... For Python 3.x, buffering was shifted to a higher level, namely the `io.BufferedReader` class, so when file reading is performed on low-level using file descriptors, this probably circumvents the buffering. However, in that case I would expect also `2e5` read system calls however it appears there are twice as many (why?). I further checked the arguments to `seek` and computed their pairwise difference. I realized that the resulting sequence was `[4, L[0], 4, L[1], 4, L[2], 4, ...]` where `L` is the `record_lengths` in bytes ... — a_guest, Mar 10 '22 at 10:04
... I.e. for each read (alternating `i4` and `d`) it seems to perform an additional `seek`. So this could hint at further interference of the low-level access with the high-level buffering. When I inspect the read system calls, they are alternating in size between `4096` (presumably the buffer chunk size) and some other values. — a_guest, Mar 10 '22 at 10:04

score 2 · Accepted Answer · answered Mar 16 '22 at 23:52

TL;DR: np.fromfile and np.frombuffer are not optimized to read many small buffers. You can load the whole file in a big buffer and then decode it very efficiently using Numba.

Analysis

The main issue is that the benchmark measure overheads. Indeed, it perform a lot of system/C calls that are very inefficient. For example, on the 24 MiB file, the while loops calls 601_214 times np.fromfile and np.frombuffer. The timing on my machine are 10.5s for read_binary_npfromfile and 1.2s for read_binary_npfrombuffer. This means respectively 17.4 us and 2.0 us per call for the two function. Such timing per call are relatively reasonable considering Numpy is not designed to efficiently operate on very small arrays (it needs to perform many checks, call some functions, wrap/unwrap CPython types, allocate some objects, etc.). The overhead of these functions can change from one version to another and unless it becomes huge, this is not a bug. The addition of new features to Numpy and CPython often impact overheads and this appear to be the case here (eg. buffering interface). The point is that it is not really a problem because there is a way to use a different approach that is much much faster (as it does not pay huge overheads).

Faster Numpy code

The main solution to write a fast implementation is to read the whole file once in a big byte buffer and then decode it using np.view. That being said, this is a bit tricky because of data alignment and the fact that nearly all Numpy function needs to be prohibited in the while loop due to their overhead. Here is an example:

def read_binary_faster_numpy(filename):
    buff = np.fromfile(filename, dtype=np.uint8)
    buff_int32 = buff.view(np.int32)
    buff_double_1 = buff[0:len(buff)//8*8].view(np.float64)
    buff_double_2 = buff[4:4+(len(buff)-4)//8*8].view(np.float64)
    nblocks = buff.size // 4      # Number of 4-byte blocks
    pos = 0                       # Displacement by block of 4 bytes
    lst = []
    while pos < nblocks:
        record_length = buff_int32[pos]
        pos += 1
        if pos + record_length * 2 > nblocks:
            break
        offset = pos // 2
        if pos % 2 == 0:          # Aligned with buff_double_1
            x = buff_double_1[offset:offset+record_length]
        else:                     # Aligned with buff_double_2
            x = buff_double_2[offset:offset+record_length]
        lst.append(x)             # np.sum is too expensive here
        pos += record_length * 2
    checksum = np.sum(np.concatenate(lst))
    assert(np.abs(checksum) < 1e-6)

The above implementation should be faster but it is a bit tricky to understand and it is still bounded by the latency of Numpy operations. Indeed, the loop is still calling Numpy functions due to operations like buff_int32[pos] or buff_double_1[offset:offset+record_length]. Even though the overheads of indexing is much smaller than the one of previous functions, it is still quite big for such a critical loop (with ~300_000 iterations)...

Better performance with... a basic pure-Python code

It turns out that the following pure-python implementation is faster, safer and simpler:

from struct import unpack_from

def read_binary_python_struct(filename):
    checksum = 0.0
    with open(filename, 'rb') as f:
        data = f.read()
        offset = 0
        while offset < len(data):
            record_length = unpack_from('@i', data, offset)[0]
            checksum += sum(unpack_from(f'{record_length}d', data, offset + 4))
            offset += 4 + record_length * 8
    assert(np.abs(checksum) < 1e-6)

This is because the overhead of unpack_from is far lower than the one of Numpy functions but it is still not great.

In fact, now the main issue is actually the CPython interpreter. It is clearly not designed with high-performance in mind. The above code push it to the limit. Allocating millions of temporary reference-counted dynamic objects like variable-sized integers and strings is very expensive. This is not reasonable to let CPython do such an operation.

Writing a high-performance code with Numba

We can drastically speed it up using Numba which can compile Numpy-based Python codes to native ones using a just-in-time compiler! Here is an example:

@nb.njit('float64(uint8[::1])')
def decode_buffer(buff):
    checksum = 0.0
    offset = 0
    while offset + 4 < buff.size:
        record_length = buff[offset:offset+4].view(np.int32)[0]
        start = offset + 4
        end = start + record_length * 8
        if end > buff.size:
            break
        x = buff[start:end].view(np.float64)
        checksum += x.sum()
        offset = end
    return checksum

def read_binary_numba(filename):
    buff = np.fromfile(filename, dtype=np.uint8)
    checksum = decode_buffer(buff)
    assert(np.abs(checksum) < 1e-6)

Numba removes nearly all Numpy overheads thanks to a native compiled code. That being said note that Numba does not implement all Numpy functions yet. This include np.fromfile which need to be called outside a Numba-compiled function.

Benchmark

Here are the performance results on my machine (i5-9600KF with a high-performance Nvme SSD) with Python 3.8.1, Numpy 1.20.3 and Numba 0.54.1.

read_binary_npfromfile:      10616 ms    (   x1)
read_binary_npfrombuffer:     1132 ms    (   x9)
read_binary_faster_numpy:      509 ms    (  x21)
read_binary_python_struct:     222 ms    (  x48)
read_binary_numba:              12 ms    ( x885)
Optimal time:                    7 ms    (x1517)

One can see that the Numba implementation is extremely fast compared to the initial Python implementation and even to the fastest alternative Python implementation. This is especially true considering that 8 ms is spent in np.fromfile and only 4 ms in decode_buffer!

Thanks for analyzing different ways and demonstrating how much faster numba implementation is. For files of 1 GB size and larger I would still have to organize some reading in chunks. — Dmytro Oliinychenko, Mar 17 '22 at 22:47
"The overhead of these functions can change from one version to another and unless it becomes huge, this is not a bug." Maybe it's not a bug, but a drop in performance from python2 to python3 is rather dramatic. — Dmytro Oliinychenko, Mar 17 '22 at 22:56