Efficiently store list of matrices

Question

I have a large list of images stored as numpy matrices. The images have different sizes e.g.

import numpy as np
from numpy.random import rand

data = [rand(100,200), rand(1024, 768)]

I am looking for a way to store this list of matrices such that it can be read fast (writing the data can be slow). I tried pickle/numpy.savez, but reading the data was slower than loading the raw images again.

I think hdf5 may be fast, however I cannot figure out how to store this list. Not mandatory, but useful would data format which allows to append data such that the list does not have to in memory as a whole.

Edit: Based on the answers so far I tried to time some suggestions

data = [rand(1024, 768) for i in np.arange(100)]
def timenp():
    np.savez("test.npz",*data)
    d=np.load('test.npz')
    loaded = [d[f] for f in d]

def timebinary():
    with file("tmp.bin","wb") as f:
        np.save(f, len(data))
        for img in data:
            np.save(f,img)

    with file("tmp.bin","rb") as f:
        n = np.load(f)
        loaded = []
        for i in np.arange(n):
            loaded.append(np.load(f))

import h5py
def timeh5py():
    with h5py.File('foo.hdf5','w') as f:
        dt = h5py.special_dtype(vlen=np.dtype('float32'))
        dset = f.create_dataset('data', (len(data),), dtype=dt)
        shapes = f.create_dataset('shapes', (len(data), 2), dtype='int32')
        dset[...] = [img.flatten() for img in data]
        shapes[...] = [img.shape for img in data]

    with h5py.File('foo.hdf5','r') as f:
        loaded=[]
        for (img, shape) in zip(f['data'],f['shapes']):
            loaded.append(np.reshape(img,shape))

python -m cProfile timenp.py
452906 function calls (451141 primitive calls) in 9.256 seconds

python -m cProfile timebinary.py 73085 function calls (71340 primitive calls) in 4.945 seconds

python -m cProfile timeh5py.py
33151 function calls (32568 primitive calls) in 4.384 seconds

Assuming `imread` is using compiled code from the PIL library I don't expect `numpy` saves to be any faster. Though that may be depend on the image format (`jpeg` may be require added processing). — hpaulj, Mar 29 '17 at 17:14
Your second timing is for successive 'save' writes to the same file. I'm not surprised that it is faster, since it doesn't have to go through the `zip` archive mechanism. That multiple saves works has been demonstrated in previous SO questions, but it seems to be an undocumented feature. — hpaulj, Mar 29 '17 at 17:19

score 0 · Answer 1 · answered Mar 29 '17 at 16:03

0

Try using the numpy savez function , which comes in both compressed and uncompressed versions.

answered Mar 29 '17 at 16:03

Efron Licht

488
3
13

I tried it, but it seems that `np.savez("test.npz",data=data)` stores the list using pickle, which is too slow. – Manuel Schmidt Mar 29 '17 at 16:12
You might want to try using a fast compression algorithm to cut down on IO time, like [google's snappy compression](https://pypi.python.org/pypi/python-snappy). It looks like someone has asked a question similar to yours about [hd5 saving of numpy arrays](http://stackoverflow.com/questions/20928136/input-and-output-numpy-arrays-to-h5py) – Efron Licht Mar 29 '17 at 16:19

score 0 · Answer 2 · edited May 23 '17 at 10:30

In [276]: alist=[np.arange(10), np.arange(3), np.arange(100)]

If I save this as np.savez('test',alist), it saves the list as one object. If instead I expand the list with *, then it puts each list element is a separate file in the archive.

In [277]: np.savez('test',*alist)
In [278]: d=np.load('test.npz')
In [279]: list(d.keys())
Out[279]: ['arr_2', 'arr_1', 'arr_0']
In [280]: d['arr_0']
Out[280]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

With np.save (and by extension savez), arrays are stored in their own compact format, which consists of a header block with shape information, and the data block, which is essentially a byte copy of its data buffer. So a np.save of an array should be as efficient any other method.

If you give np.save an non-array object it will use that object's pickle method. But note that the pickle method for an array is the save method I just described. So a pickle of an array should still be efficient.

Keep in mind that npz files are lazy-loaded.

With h5py, arrays are saved to named data sets. In sense it is like the savez above - the elements of the list have to have names, whether generated automatically or by your code.

I don't know how h5py speeds compare with a save(z).

h5py can handle arrays that are ragged in one dimension. I've explored that in previous SO questions. Storing multidimensional variable length array with h5py

How to save list of numpy.arrays of different shape with h5py?

Efficiently store list of matrices

2 Answers2