0

I've been using the pickle library to read and write numpy arrays but they tend to be very large. In my quest for finding out if there was a better way, I found Mark's answer on this page (the one with the chart). Basically, storing it as a binary file appears to not only be the fastest and reading and writing, but also takes among the smallest amount of memory. So I clicked on his github link and found on line 96 the code I believe he uses to save the ndarrays. His code is:

class Binary(TimeArrStorage):
    def save(self, arr, pth):
        with open(pth, 'wb+') as fh:
            fh.write(b'{0:s} {1:d} {2:d}\n'.format(arr.dtype, *arr.shape))
            fh.write(arr.data)
            sync(fh)

    def load(self, pth):
        with open(pth, 'rb') as fh:
            dtype, w, h = str(fh.readline()).split()
            return frombuffer(fh.read(), dtype=dtype).reshape((int(w), int(h)))

My specific questions are, what is the meaning of the string passed to the first call to fh.write? I assume the preceding "b" means binary, but what about the {0:s} {1:d} {2:d}, especially since there are only two parameters inside the parenthesis after format. Second question is can this method be used for ndarrays of any data type? Third question is, do I need to call the sync method (method is defined at the top of the github page)? And last question is, I looked up what arr.data returns if arr is an ndarray and it's basically a memory location to where the data begins, so how does this code know it's reached the end of the object it's trying to write?

Isaac
  • 204
  • 2
  • 10
  • `.format` is a string formatting method that was introduced with Python3. The older method used `%` etc. `np.save/load` should produce about the same size files. They use a fixed size prelimnary buffer to store shape and dtype information (something like 80 bytes). – hpaulj Nov 01 '17 at 02:18

1 Answers1

0

I have (from another question)

In [509]: arr
Out[509]: 
array([[-1.0856306 ,  0.99734545],
       [ 0.2829785 , -1.50629471],
       [-0.57860025,  1.65143654]])

I can format a string with its attributes:

In [510]: '%s %d %d'%(arr.dtype, *arr.shape)
Out[510]: 'float64 3 2'

The format in your example gives an error in py3 (it's ok in py2):

In [500]: '{0:s} {1:d} {2:d}'.format(arr.dtype, *arr.shape)
...
TypeError: non-empty format string passed to object.__format__

This is ok:

In [515]: '{0} {1} {2}'.format(arr.dtype, *arr.shape)
Out[515]: 'float64 3 2'

In [533]: '{0!s} {1:d} {2:d}'.format(arr.dtype, *arr.shape)
Out[533]: 'float64 3 2'

For a 2d array, arr.shape is a 2 element tuple, and *arr.shape expands it. SO with a 2d array, there are 3 arguments.

Now that you mention it arr.data does look funny. I suspect they mean the whole data buffer content, but this particular attribute is the address, not the content.

As I mentioned in the comment, np.save does essentially the same thing, with a slightly larger initial block. If this code is buggy, it would be wiser to stick with the tried-n-proven np.save.

Look at np.lib.npyio.format to see the full np.save code. It writes a header, and then writes the databuffer with:

array.tofile(fp)

np.load uses np.fromfile if it can, but will fallback to using frombuffer.


In Py2 this works:

>>> arr=np.ones((2,3))
>>> b'{0:s} {1:d} {2:d}'.format(arr.dtype, *arr.shape)
'float64 2 3'
hpaulj
  • 221,503
  • 14
  • 230
  • 353