Storing large unstructured list in Python

Question

Do you know any Python libraries good for storing large unstructured lists (for instance a=[0,numpy.asarray([1,2,3])])?

What I've seen so far h5py doesn't support this kind of data, and pickle seems to be slow. Any other alternatives?

For my purpose we can assume that I am dealing with data of the same type (numpy arrays with int type) but different shape.

possible duplicate of [best way to preserve numpy arrays on disk](http://stackoverflow.com/questions/9619199/best-way-to-preserve-numpy-arrays-on-disk) — Slater Victoroff, Aug 21 '13 at 21:47
@SlaterTyranus I don't think it's duplicate since he doesn't store just `numpy` arrays. In the example he gave he also have scalar values and maybe who knows what. All in all I think this is a little bit more general question than just storing `numpy` arrays. — Viktor Kerkez, Aug 21 '13 at 22:51
@ViktorKerkez In the example yes, but in his actual question he said to assume he was just using numpy arrays with an int type. Either way the question should be reworded, but depending on which question OP is actually asking it may or may not be a duplicate. — Slater Victoroff, Aug 21 '13 at 22:59
@m4linka: Do you want to store **only** `numpy` arrays, if so, then this question is duplicate. Or you want to store a list of mixed scalars and `numpy` arrays and whatnot. — Viktor Kerkez, Aug 21 '13 at 23:03
Clarification: I see a scalar as a special case of one length vector, so I am happy with a=[numpy.zeros(1), numpy.asarray([1,2,3])] (I am not sure about the overhead of using matrix to store just a scalar, but I guess it still should be better than pickle). When I try to store 'a' using dset=f.create_dataset('dset',data=a) I get error 'Object type dtype('0') has no native HDF5 equivalent'. I guess it is due to different shapes in the list. So in short, I would like to store numpy arrays of different sizes but with stored values of the same type. — m4linka, Aug 22 '13 at 07:54
@user2357112 I've tried cPickle and got 'error return without exception set'. For my convenience I prefer working with python 2.7. — m4linka, Aug 22 '13 at 07:58

score 0 · Answer 1 · answered Aug 21 '13 at 21:44

If you think that Pickle and cPickle are too slow you should look into either Marshall or Shelve as they are the two other major off-the-shelf serialization libraries. If that doesn't work for you you're going to want to start using a legitimate database.

After all, the ability to store and retrieve large amounts of data quickly is basically what a database is, and these compression modules are only going to get you so far towards that. If they were perfect you wouldn't need databases.

If you don't want to use either of those there are actually tools out there specifically for this purpose, but I feel like it would be a one-off. You can look here for one such service, but there are a couple more.

Viktor Kerkez · Accepted Answer · 2013-08-21T22:45:54.340

Actually you can store and retrieve this kind of data into a hdf5 file with just a little bit custom logic:

import tables
import numpy as np

def store(filename, name, data):
    with tables.openFile(filename, 'w') as store:
        store.createGroup('/', name)
        for i, item in enumerate(data):
            store.createArray('/%s' % name, 'item_%s' % i, item)

def read(filename, name):
    with tables.openFile(filename, 'r') as store:
        nodes = store.listNodes('/%s' % name)
        data = [0] * len(nodes)
        for node in nodes:
            pos = int(node.name.split('_')[-1])
            data[pos] = node.read()
        return data

Usage:

>>> a = [0, np.array([4,5,6])]
>>> store('my_data.h5', 'a', a)
>>> print read('my_data.h5', 'a')
[0, array([4, 5, 6])]

This is just the first thing that fall on my mind, I'm sure there is a more efficient pattern of storing list into hdf5 files. But let's time it and see if even this naive implementation is faster than cPickle:

In [7]: a = []
        for i in range(1, 500):
            if i % 10 == 0:
                a.append(i)
            else:
                a.append(np.random.randn(i, i))
In [8]: %%timeit
        store('my_data.h5', 'a', a)
        read_data = read('my_data.h5', 'a')
1 loops, best of 3: 1.32 s per loop
In [9]: %%timeit
        with open('test.pickle', 'wb') as f:
            cPickle.dump(a, f)
        with open('test.pickle', 'rb') as f:
            read_data = cPickle.load(f)
1 loops, best of 3: 1min 58s per loop

Depending on the data the difference is even bigger or a little bit smaller. But even this stupid implementation is at least 10x faster than cPickle for any data that contains numpy arrays.

Neat, and very close to what I need. Thanks a lot. – m4linka Aug 22 '13 at 20:14 — m4linka, Aug 22 '13 at 20:14

Storing large unstructured list in Python

2 Answers2