Could someone explain what is happening in this code to save a numpy array as a binary file?

Question

I've been using the pickle library to read and write numpy arrays but they tend to be very large. In my quest for finding out if there was a better way, I found Mark's answer on this page (the one with the chart). Basically, storing it as a binary file appears to not only be the fastest and reading and writing, but also takes among the smallest amount of memory. So I clicked on his github link and found on line 96 the code I believe he uses to save the ndarrays. His code is:

class Binary(TimeArrStorage):
    def save(self, arr, pth):
        with open(pth, 'wb+') as fh:
            fh.write(b'{0:s} {1:d} {2:d}\n'.format(arr.dtype, *arr.shape))
            fh.write(arr.data)
            sync(fh)

    def load(self, pth):
        with open(pth, 'rb') as fh:
            dtype, w, h = str(fh.readline()).split()
            return frombuffer(fh.read(), dtype=dtype).reshape((int(w), int(h)))

My specific questions are, what is the meaning of the string passed to the first call to fh.write? I assume the preceding "b" means binary, but what about the {0:s} {1:d} {2:d}, especially since there are only two parameters inside the parenthesis after format. Second question is can this method be used for ndarrays of any data type? Third question is, do I need to call the sync method (method is defined at the top of the github page)? And last question is, I looked up what arr.data returns if arr is an ndarray and it's basically a memory location to where the data begins, so how does this code know it's reached the end of the object it's trying to write?

`.format` is a string formatting method that was introduced with Python3. The older method used `%` etc. `np.save/load` should produce about the same size files. They use a fixed size prelimnary buffer to store shape and dtype information (something like 80 bytes). — hpaulj, Nov 01 '17 at 02:18

hpaulj · Answer 1 · 2017-11-01T05:23:49.020

I have (from another question)

In [509]: arr
Out[509]: 
array([[-1.0856306 ,  0.99734545],
       [ 0.2829785 , -1.50629471],
       [-0.57860025,  1.65143654]])

I can format a string with its attributes:

In [510]: '%s %d %d'%(arr.dtype, *arr.shape)
Out[510]: 'float64 3 2'

The format in your example gives an error in py3 (it's ok in py2):

In [500]: '{0:s} {1:d} {2:d}'.format(arr.dtype, *arr.shape)
...
TypeError: non-empty format string passed to object.__format__

This is ok:

In [515]: '{0} {1} {2}'.format(arr.dtype, *arr.shape)
Out[515]: 'float64 3 2'

In [533]: '{0!s} {1:d} {2:d}'.format(arr.dtype, *arr.shape)
Out[533]: 'float64 3 2'

For a 2d array, arr.shape is a 2 element tuple, and *arr.shape expands it. SO with a 2d array, there are 3 arguments.

Now that you mention it arr.data does look funny. I suspect they mean the whole data buffer content, but this particular attribute is the address, not the content.

As I mentioned in the comment, np.save does essentially the same thing, with a slightly larger initial block. If this code is buggy, it would be wiser to stick with the tried-n-proven np.save.

Look at np.lib.npyio.format to see the full np.save code. It writes a header, and then writes the databuffer with:

array.tofile(fp)

np.load uses np.fromfile if it can, but will fallback to using frombuffer.

In Py2 this works:

>>> arr=np.ones((2,3))
>>> b'{0:s} {1:d} {2:d}'.format(arr.dtype, *arr.shape)
'float64 2 3'

Could someone explain what is happening in this code to save a numpy array as a binary file?

1 Answers1