1

The use case: Python class stores large numpy arrays (large, but small enough that working with them in-memory is a breeze) in a useful structure. Here's a cartoon of the situation:

main class: Environment; stores useful information pertinent to all balls

"child" class: Ball; stores information pertinent to this particular ball

Environment member variable: balls_in_environment (list of Balls)

Ball member variable: large_numpy_array (NxN numpy array that is large, but still easy to work with in-memory)

I would like to preferably persist Environment as whole.

Some options:

  • pickle: too slow, and it produces output that takes up a LOT of space on the hard drive

  • database: too much work; I could store the important information in the class in a database (requires me to write functions to take info from the class, and put it into the DB) and later rebuild the class by creating a new instance, and refilling it with data from the DB (requires me to write functions to do the rebuilding)

  • JSON: I am not very familiar with JSON, but Python has a standard library to deal with it, and it is the recommended solution of this article -- I don't see how JSON would be more compact than pickle though; more importantly, doesn't deal nicely with numpy

  • MessagePack: another recommended package by the same article mentioned above; however, I have never heard of it, and don't want to strike out into the unknown with what seems to be a standard problem

  • numpy.save + something else: store the numpy arrays associated with each Ball, using numpy.save functionality, and store the non-numpy stuff separately somehow (tedious)?

What is the best option for my use case?

bzm3r
  • 3,113
  • 6
  • 34
  • 67
  • Quick check: if you're on Python 2, did you try `cPickle`? And did you set the protocol version? – user2357112 Nov 16 '15 at 18:58
  • 2
    [hdf5](http://www.h5py.org/) is probably your best bet. It's what pandas uses for quick IO of large datasets. – Adam Acosta Nov 16 '15 at 18:59
  • 1
    `np.save` resorts to pickle for variables (and elements of arrays) that it can't save as normal arrays. `savez` saves multiple arrays, one per file, in a `zip` archive (compressed or not). – hpaulj Nov 16 '15 at 19:01
  • It should be relatively painless to use HDF5 to serialize arbitrary Python classes with numpy arrays as members (see [here](http://stackoverflow.com/q/18071075/1461210) for an example using dicts) – ali_m Nov 16 '15 at 19:14
  • Another good option would be to use [`joblib.dump`](https://pythonhosted.org/joblib/generated/joblib.dump.html#joblib.dump), which internally uses `np.save` for numpy arrays and `cPickle` for everything else. – ali_m Nov 16 '15 at 19:20
  • @user2357112 I haven't tried `cPickle`; I'll look into it along with the protocol version options! – bzm3r Nov 16 '15 at 23:04
  • @AdamAcosta I don't see how that's my best since (and I might be misunderstanding, so please correct me) it's geared particularly towards storing VERY large numerical arrays, rather than Python objects such as classes? I don't have very large numerical arrays, and I prefer to work with the arrays in-memory. – bzm3r Nov 16 '15 at 23:06
  • @ali_m So, if a Python class has `numpy` arrays, does `joblib` handle the hassle of picking apart the `numpy` array for saving using `np.save`, and then re-integrating those member variables back in during load time? – bzm3r Nov 16 '15 at 23:10

1 Answers1

0

As I mentioned in the comments, joblib.dump might be a good option. It uses np.save to efficiently store numpy arrays, and cPickle for everything else:

import numpy as np
import cPickle
import joblib
import os


class SerializationTest(object):
    def __init__(self):
        self.array = np.random.randn(1000, 1000)

st = SerializationTest()
fnames = ['cpickle.pkl', 'numpy_save.npy', 'joblib.pkl']

# using cPickle
with open(fnames[0], 'w') as f:
    cPickle.dump(st, f)

# using np.save
np.save(fnames[1], st)

# using joblib.dump (without compression)
joblib.dump(st, fnames[2])

# check file sizes
for fname in fnames:
    print('%15s: %8.2f KB' % (fname, os.stat(fname).st_size / 1E3))
#     cpickle.pkl: 23695.56 KB
#  numpy_save.npy:  8000.33 KB
#      joblib.pkl:     0.18 KB

One potential downside is that because joblib.dump uses cPickle to serialize Python objects, the resulting files are not portable from Python 2 to 3. For better portability you could look into using HDF5, e.g. here.

Community
  • 1
  • 1
ali_m
  • 71,714
  • 23
  • 223
  • 298