4

I’m reading a binary file using numpy and wondering whether I should use repeated calls to numpy.fromfile or reading from the file manually and calling numpy.frombuffer:

# Alternative 1: fromfile
with open(path, 'rb') as f:
    num = numpy.fromfile(f, 'u4', 1)[0]
    l = numpy.fromfile(f, 'u4', num)
    o = numpy.fromfile(f, 'u4', num)
    m = numpy.fromfile(f, 'f4', num)
    c = numpy.fromfile(f, '3f4', num)
    s = numpy.fromfile(f, '3u4', num)

# Alternative 2: read & frombuffer
def fread(f, fmt):
    dtype = numpy.dtype(fmt)
    return numpy.frombuffer(f.read(dtype.itemsize), dtype)[0]
with open(path, 'rb') as f:
    num = fread(f, 'u4')
    l = fread(f, f'({num},)u4')
    o = fread(f, f'({num},)u4')
    m = fread(f, f'({num},)f4')
    c = fread(f, f'({num},3)f4')
    s = fread(f, f'({num},3)u4')

Is there a difference (performance or otherwise) between these two methods?

Socob
  • 1,189
  • 1
  • 12
  • 26
  • 2
    Why don't you just measure runtime and compare ? Either use `time.time` or install `line_profiler` for this. – rocksportrocker Sep 04 '18 at 11:59
  • 1
    @rocksportrocker “performance or otherwise” – Socob Sep 04 '18 at 12:35
  • You should probably read and link to the docs for both functions. Those should describe any major differences in behavior. – user2699 Sep 04 '18 at 12:47
  • 2
    @user2699 Unfortunately, I have found that many `numpy` functions are rather poorly documented (in that they don’t explain what the function actually *does*, rather giving a small number of examples instead of a specification of behavior), these two included. `numpy.frombuffer`: “*Interpret a buffer as a 1-dimensional array.*” `numpy.fromfile`: “*Construct an array from data in a text or binary file. A highly efficient way of reading binary data with a known data-type, as well as parsing simply formatted text files. Data written using the tofile method can be read using this function.*” – Socob Sep 04 '18 at 13:14
  • @Socob, the doc's aren't always the clearest but they're usually the best place to start. For instance from the docs for these functions it's clear that `fromfile` will handle text data while `frombuffer` only works with binary data. It also suggests that `frombuffer` doesn't make a copy of the underlying data. – user2699 Sep 04 '18 at 13:38
  • Both functions are compiled, so we can't easily study their code. `fromfile` probably does something like your `fread`. That is, use a Python call call to read a block of the file, and then use the same sort of `frombuffer` logic or code to parse it. You might save time by doing one `f.read` to load the whole buffer, and then use `frombuffer` with offsets. But the disk reads are probably already cached at a lower level. – hpaulj Sep 04 '18 at 16:27
  • 3
    For larger chunks (num) and standard datatypes I would definitely prefer the first aproach. In case of quite small chunks (where acess latency does play a role) or non standard dtypes eg. https://stackoverflow.com/a/45070947/4045774, I would prefer reading a raw-datachunk (a few MB) in memory and process it afterwards (-> a modified version of the second aproach) – max9111 Sep 05 '18 at 11:51

0 Answers0