1

I use numpy.save and numpy.load to R/W large datasets in my project. I realized that that numpy.save does not apply append mode. For instance (Python 3):

import numpy as np

n = 5
dim = 5
for _ in range(3):
    Matrix = np.random.choice(np.arange(10, 40, dim), size=(n, dim))
    np.save('myfile', Matrix)

M1 = np.load('myfile.npy', mmap_mode='r')[1:7].copy()
print(M1)

Loading specific portion of data using slicing [1:7] is not correct because the np.save does not append. I found this answer but it looks strange ( file(filename, 'a') what is file file??). Is there a clever workaround to achieve that without using additional lists?

Justin
  • 33
  • 1
  • 1
  • 4
  • That `a` and `file` means use open a file in append mode – Zhiya Mar 27 '18 at 20:42
  • *"what is `file`?"* It is a [Python 2 builtin function](https://docs.python.org/2/library/functions.html#file). It was removed in Python 3. – Warren Weckesser Mar 27 '18 at 20:43
  • @Zhiya Yes, but it will complain like this `write() argument must be str, not bytes` which is because of `Matrix` – Justin Mar 27 '18 at 20:44
  • @Medo Can we see your code and error ? – Zhiya Mar 27 '18 at 20:46
  • `np.load('myfile.npy', mmap_mode='r')[1:7]` wouldn't have worked anyway. The `npy` file format doesn't work that way. – user2357112 Mar 27 '18 at 20:46
  • @WarrenWeckesser Thank you. I just realized that!! I started learning Python from Python 3 – Justin Mar 27 '18 at 20:46
  • @user2357112 but it is working in non-append mode. I use this load to save memory (not loading the entire file). It was suggested by this answer https://stackoverflow.com/questions/49518962/how-slicing-numpy-load-file-is-loaded-into-memory/49519121#49519121 – Justin Mar 27 '18 at 20:49
  • @Medo: I know. I wrote that answer. What I mean is that even if you got `np.save` to append the data, the resulting file contents wouldn't be a valid npy file representing a combined array. – user2357112 Mar 27 '18 at 20:51

3 Answers3

3

The npy file format doesn't work that way. An npy file encodes a single array, with a header specifying shape, dtype, and other metadata. You can see the npy file format spec in the NumPy docs.

Support for appending data was not a design goal of the npy format. Even if you managed to get numpy.save to append to an existing file instead of overwriting the contents, the result wouldn't be a valid npy file. Producing a valid npy file with additional data would require rewriting the header, and since this could require resizing the header, it could shift the data and require the whole file to be rewritten.

NumPy comes with no tools to append data to existing npy files, beyond reading the data into memory, building a new array, and writing the new array to a file. If you want to save more data, consider writing a new file, or pick a different file format.

user2357112
  • 260,549
  • 28
  • 431
  • 505
2

In Python3 repeated save and load to the same open file works:

In [113]: f = open('test.npy', 'wb')
In [114]: np.save(f, np.arange(10))
In [115]: np.save(f, np.zeros(10))
In [116]: np.save(f, np.ones(10))
In [117]: f.close()
In [118]: f = open('test.npy', 'rb')
In [119]: for _ in range(3):
     ...:     print(np.load(f))
     ...:     
[0 1 2 3 4 5 6 7 8 9]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
In [120]: np.load(f)
OSError: Failed to interpret file <_io.BufferedReader name='test.npy'> as a pickle

Each save writes a self contained block of data to the file. That consists of a header block, and an image of the databuffer. The header block has information about the length of the databuffer.

Each load reads the defined header block, and the known number of data bytes.

As far as I know this is not documented, but has been demonstrated in previous SO questions. It is also evident from the save and load code.

Note these are separate arrays, both on saving and loading. But we could concatenate the loads into one file if the dimensions are compatible.

In [122]: f = open('test.npy', 'rb')
In [123]: np.stack([np.load(f) for _ in range(3)])
Out[123]: 
array([[0., 1., 2., 3., 4., 5., 6., 7., 8., 9.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
In [124]: f.close()

Append multiple numpy files to one big numpy file in python

loading arrays saved using numpy.save in append mode

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • 1
    It's worth noting that this is incompatible with `mmap_mode`, which the questioner is using. Aside from that, I'd like to say that calling such a file `anything.npy` is misleading when the contents are really an ad-hoc format based on NPY, and that I'd recommend a different form of data storage. For example, the `npz` format, which is just a zip file with `npy` files inside, which you can manipulate and add data to with the `zipfile` standard library module. – user2357112 Mar 27 '18 at 23:58
1

The file function was deprecated in Python 3. Though I won't guarantee that it works, the Python 3 code equivalent to the code in the link in your question would be

with open('myfile.npy', 'ab') as f_handle:
    np.save(f_handle, Matrix)

This should then append Matrix to 'myfile.npy'.

jmd_dk
  • 12,125
  • 9
  • 63
  • 94
  • 1
    Thank you very much. I tried your suggestion but because `Matrix` is numpy list, I get this error `TypeError: write() argument must be str, not bytes` – Justin Mar 27 '18 at 20:53