loading arrays saved using numpy.save in append mode

Question

I save arrays using numpy.save() in append mode:

f = open("try.npy", 'ab')
sp.save(f,[1, 2, 3, 4, 5])
sp.save(f,[6, 7, 8, 9, 10])
f.close()

Can I then load the data in LIFO mode? Namely, if I wish to now load the 6-10 array, do I need to load twice (use b):

f = open("try.npy", 'r')
a = sp.load(f)
b = sp.load(f)
f.close()

or can I straightforward load the second appended save?

Try what? When I just use sp.load then I get the first save I did. Only when using load again can I get the second save I did. So if I want always the last piece I need to follow how many saves I did (instead of just loading the last regardless of how many appends I did) — Eyal Leviatan, Mar 02 '16 at 13:28
You don't need to open the file in `append`; that's only if you writing to a file that had previously been written. 'wb' will be ok with 2 saves like this. — hpaulj, Mar 02 '16 at 23:11

hpaulj · Answer 1 · 2016-08-23T16:19:34.337

I'm a little surprised that this sequential save and load works. I don't think it is documented (please correct me). But evidently each save is a self contained unit, and load reads to the end of that unit, as opposed to the end of the file.

Think of each load as a readline. You can't read just the last line of a file; you have to read all the ones before it.

Well - there is a way of reading the last - using seek to move the file read to a specific point. But to do that you have to know exactly where the desired block starts.

np.savez is the intended way of saving multiple arrays to a file, or rather to a zip archive.

save saves two parts, a header that contains information like dtype, shape and strides, and a copy of the array's data buffer. The nbytes attribute gives the size of the data buffer. At least this is the case for numeric and string dtypes.

save doc has an example of using an opened file - with seek(0) to rewind the file for use by load.

np.lib.npyio.format has more information on the saving format. Looks like it is possible to determine the length of the header by reading its first few bytes. You could probably use functions in the module to perform all these reads and calculations.

If I read the whole file from the example, I get:

In [696]: f.read()
Out[696]: 
b"\x93NUMPY\x01\x00F\x00
{'descr': '<i4', 'fortran_order': False, 'shape': (5,), }\n
 \x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00
\x93NUMPY\x01\x00F\x00
{'descr': '<i4', 'fortran_order': False, 'shape': (5,), }\n
 \x06\x00\x00\x00\x07\x00\x00\x00\x08\x00\x00\x00\t\x00\x00\x00\n\x00\x00\x00"

I added line breaks to highlight the distinct pieces of this file. Notice that each save starts with \x93NUMPY.

With an open file f, I can read the header (or the first array) with:

In [707]: np.lib.npyio.format.read_magic(f)
Out[707]: (1, 0)
In [708]: np.lib.npyio.format.read_array_header_1_0(f)
Out[708]: ((5,), False, dtype('int32'))

and I can load the data with:

In [722]: np.fromfile(f, dtype=np.int32, count=5)
Out[722]: array([1, 2, 3, 4, 5])

I deduced this from np.lib.npyio.format.read_array function code.

Now the file is positioned at:

In [724]: f.tell()
Out[724]: 100

which is the head of the next array:

In [725]: np.lib.npyio.format.read_magic(f)
Out[725]: (1, 0)
In [726]: np.lib.npyio.format.read_array_header_1_0(f)
Out[726]: ((5,), False, dtype('int32'))
In [727]: np.fromfile(f, dtype=np.int32, count=5)
Out[727]: array([ 6,  7,  8,  9, 10])

and we are at EOF.

And knowing that int32 has 4 bytes, we can calculate that the data occupies 20 bytes. So we could skip over an array by reading the header, calculating the size of the data block, and seek past it to get to the next array. For small arrays that work isn't worth it; but for very large ones, it may be useful.

interestingly `seek(100)` finds the last part in this example, and `seek(0)` and `seek(200)` find the first. Perhaps `mmap_mode` is the way forward here? — Colin Dickie, Mar 02 '16 at 16:29
`mmap_mode` is intended to read the coherent block of one array; I can't imagine it working across `save` blocks. — hpaulj, Mar 02 '16 at 16:41
I'm a little surprised that `seek(200)` works. In my test it puts the file at the end, so I get an EOF related error if I try further `load` or `read`. — hpaulj, Mar 03 '16 at 02:34

loading arrays saved using numpy.save in append mode

1 Answers1

Linked