15

I want to do hierarchical key-value storage in Python, which basically boils down to storing dictionaries to files. By that I mean any type of dictionary structure, that may contain other dictionaries, numpy arrays, serializable Python objects, and so forth. Not only that, I want it to store numpy arrays space-optimized and play nice between Python 2 and 3.

Below are methods I know are out there. My question is what is missing from this list and is there an alternative that dodges all my deal-breakers?

  • Python's pickle module (deal-breaker: inflates the size of numpy arrays a lot)
  • Numpy's save/savez/load (deal-breaker: Incompatible format across Python 2/3)
  • PyTables replacement for numpy.savez (deal-breaker: only handles numpy arrays)
  • Using PyTables manually (deal-breaker: I want this for constantly changing research code, so it's really convenient to be able to dump dictionaries to files by calling a single function)

The PyTables replacement of numpy.savez is promising, since I like the idea of using hdf5 and it compresses the numpy arrays really efficiently, which is a big plus. However, it does not take any type of dictionary structure.

Lately, what I've been doing is to use something similar to the PyTables replacement, but enhancing it to be able to store any type of entries. This actually works pretty well, but I find myself storing primitive data types in length-1 CArrays, which is a bit awkward (and ambiguous to actual length-1 arrays), even though I set chunksize to 1 so it doesn't take up that much space.

Is there something like that already out there?

Thanks!

Gustav Larsson
  • 8,199
  • 3
  • 31
  • 51
  • Have you considered using a NoSQL database system like MongoDB? –  Aug 06 '13 at 03:31
  • @Xaranke That's a good idea, but I doubt it will offer efficient numpy array storage... or maybe it will? – Gustav Larsson Aug 06 '13 at 04:06
  • You can save a numpy array as a binary object as shown here: http://stackoverflow.com/questions/6367589/saving-numpy-array-in-mongodb –  Aug 06 '13 at 04:20
  • @Xaranke I saw that, but it relies on Python pickling, so it won't offer any space improvement over just pickling. Of course, I could always try to binarize them in some other way, but that basically puts me back on square one. – Gustav Larsson Aug 06 '13 at 04:38
  • I found this link **https://pypi.python.org/pypi/msgpack-python/. Seems to be a pretty efficient library being used by Redis as well as Pinterest. You may want to take a look – Prahalad Deshpande Aug 06 '13 at 06:05
  • As far as I know, there is nothing which can automatically dump dictionaries etc. to hdf5. You might look into doing it manually via `h5py`. Other than dictionaries and numpy arrays, what else do you want to store? – Yossarian Aug 06 '13 at 15:56
  • Have you tried `np.memmap`? – Saullo G. P. Castro Aug 07 '13 at 04:30
  • @PrahaladDeshpande Thanks, I tried it, but it does not seem to support numpy arrays. – Gustav Larsson Aug 07 '13 at 14:51
  • @Yossarian This is what I'm leaning towards. Actually, I'm thinking of doing it in parallel for PyTables and h5py, just to see which one turns out best. h5py doesn't support serialized objects without what they refer to in the docs as "a temporary fix," so I'm leaning towards PyTables. – Gustav Larsson Aug 07 '13 at 14:55
  • @GustavLarsson [This](https://github.com/proggy/h5obj) claims to do some of what you want, though it didn't actually work for anything I tried. Their idea is store pickled objects as strings. I am tempted to create a library to do this myself, with the addition of dictionaries of dictionaries. – Yossarian Aug 07 '13 at 15:30

5 Answers5

4

After asking this two years ago, I starting coding my own HDF5-based replacement of pickle/np.save. Ever since, it has matured into a stable package, so I thought I would finally answer and accept my own question because it is by design exactly what I was looking for:

Gustav Larsson
  • 8,199
  • 3
  • 31
  • 51
2

I recently found myself with a similar problem, for which I wrote a couple of functions for saving the contents of dicts to a group in a PyTables file, and loading them back into dicts.

They process nested dictionary and group structures recursively, and handle objects with types that are not natively supported by PyTables by pickling them and storing them as string arrays. It's not perfect, but at least things like numpy arrays will be stored efficiently. There's also a check included to avoid inadvertently loading enormous structures into memory when reading the group contents back into a dict.

import tables
import cPickle

def dict2group(f, parent, groupname, dictin, force=False, recursive=True):
    """
    Take a dict, shove it into a PyTables HDF5 file as a group. Each item in
    the dict must have a type and shape compatible with PyTables Array.

    If 'force == True', any existing child group of the parent node with the
    same name as the new group will be overwritten.

    If 'recursive == True' (default), new groups will be created recursively
    for any items in the dict that are also dicts.
    """
    try:
        g = f.create_group(parent, groupname)
    except tables.NodeError as ne:
        if force:
            pathstr = parent._v_pathname + '/' + groupname
            f.removeNode(pathstr, recursive=True)
            g = f.create_group(parent, groupname)
        else:
            raise ne
    for key, item in dictin.iteritems():
        if isinstance(item, dict):
            if recursive:
                dict2group(f, g, key, item, recursive=True)
        else:
            if item is None:
                item = '_None'
            f.create_array(g, key, item)
    return g


def group2dict(f, g, recursive=True, warn=True, warn_if_bigger_than_nbytes=100E6):
    """
    Traverse a group, pull the contents of its children and return them as
    a Python dictionary, with the node names as the dictionary keys.

    If 'recursive == True' (default), we will recursively traverse child
    groups and put their children into sub-dictionaries, otherwise sub-
    groups will be skipped.

    Since this might potentially result in huge arrays being loaded into
    system memory, the 'warn' option will prompt the user to confirm before
    loading any individual array that is bigger than some threshold (default
    is 100MB)
    """

    def memtest(child, threshold=warn_if_bigger_than_nbytes):
        mem = child.size_in_memory
        if mem > threshold:
            print '[!] "%s" is %iMB in size [!]' % (child._v_pathname, mem / 1E6)
            confirm = raw_input('Load it anyway? [y/N] >>')
            if confirm.lower() == 'y':
                return True
            else:
                print "Skipping item \"%s\"..." % g._v_pathname
        else:
            return True
    outdict = {}
    for child in g:
        try:
            if isinstance(child, tables.group.Group):
                if recursive:
                    item = group2dict(f, child)
                else:
                    continue
            else:
                if memtest(child):
                    item = child.read()
                    if isinstance(item, str):
                        if item == '_None':
                            item = None
                else:
                    continue
            outdict.update({child._v_name: item})
        except tables.NoSuchNodeError:
            warnings.warn('No such node: "%s", skipping...' % repr(child))
            pass
    return outdict

It's also worth mentioning joblib.dump and joblib.load, which tick all of your boxes apart from Python 2/3 cross-compatibility. Under the hood they use np.save for numpy arrays and cPickle for everything else.

ali_m
  • 71,714
  • 23
  • 223
  • 298
  • Thanks, this is very similar to what I started doing. Except my catch-all is a CArray of size 1 that I create with `Atom.from_dtype(np.dtype(type(obj))`. Since PyTables natively supports the numpy `object` type, classes will get that type and just work. Of course PyTables will pickle it under the hood, but it hides that for me so the code I use is really short. It is ambiguous with legitimate `(,1)`-sized arrays though, but that's a minor problem that I could solve if it becomes a problem. – Gustav Larsson Aug 24 '13 at 03:27
  • Nice. That might be the first practical use I've seen for the `np.object` type. – ali_m Aug 24 '13 at 10:29
0

I tried playing with np.memmap for saving an array of dictionaries. Say we have the dictionary:

a = np.array([str({'a':1, 'b':2, 'c':[1,2,3,{'d':4}]}])

first I tried to directly save it to a memmap:

f = np.memmap('stack.array', dtype=dict, mode='w+', shape=(100,))
f[0] = d
# CRASHES when reopening since it looses the memory pointer

f = np.memmap('stack.array', dtype=object, mode='w+', shape=(100,))
f[0] = d
# CRASHES when reopening for the same reason

the way it worked is converting the dictionary to a string:

f = np.memmap('stack.array', dtype='|S1000', mode='w+', shape=(100,))
f[0] = str(a)

this works and afterwards you can eval(f[0]) to get the value back.

I do not know the advantage of this approach over the others, but it deserves a closer look.

Saullo G. P. Castro
  • 56,802
  • 26
  • 179
  • 234
  • 1
    Thanks, I appreciate it, but I have to speak frankly about this one and say that this is a terrible idea on many accounts: Using str/eval as a means of serialization is not reliable at all, and it introduces a major security hole in your program. Large numpy arrays will by default be printed with `...`, which can't be evalled, class instances in general will not be reconstructed, etc. Besides, even if they were, storing everything as their string representations takes much more space than any other method discussed so far. – Gustav Larsson Aug 07 '13 at 14:50
0

I absolutely recommend a python object database like ZODB. It seems pretty well suited for your situation, considering you store objects (literally whatever you like) to a dictionary - this means you can store dictionaries inside dictionaries. I've used it in a wide range of problems, and the nice thing is that you can just hand somebody the database file (the one with a .fs extension). With this, they'll be able to read it in, and perform any queries they wish, and modify their own local copies. If you wish to have multiple programs simultaneously accessing the same database, I'd make sure to look at ZEO.

Just a silly example of how to get started:

from ZODB import DB
from ZODB.FileStorage import FileStorage
from ZODB.PersistentMapping import PersistentMapping
import transaction
from persistent import Persistent
from persistent.dict import PersistentDict
from persistent.list import PersistentList

# Defining database type and creating connection.
storage = FileStorage('/path/to/database/zodbname.fs') 
db = DB(storage)
connection = db.open()
root = connection.root()

# Define and populate the structure.
root['Vehicle'] = PersistentDict() # Upper-most dictionary
root['Vehicle']['Tesla Model S'] = PersistentDict() # Object 1 - also a dictionary
root['Vehicle']['Tesla Model S']['range'] = "208 miles"
root['Vehicle']['Tesla Model S']['acceleration'] = 5.9
root['Vehicle']['Tesla Model S']['base_price'] = "$71,070"
root['Vehicle']['Tesla Model S']['battery_options'] = ["60kWh","85kWh","85kWh Performance"]
# more attributes here

root['Vehicle']['Mercedes-Benz SLS AMG E-Cell'] = PersistentDict() # Object 2 - also a dictionary
# more attributes here

# add as many objects with as many characteristics as you like.

# commiting changes; up until this point things can be rolled back
transaction.get().commit()
transaction.get().abort()
connection.close()
db.close()
storage.close()

Once the database is created it's very easy use. Since it's an object database (a dictionary), you can access objects very easily:

#after it's opened (lines from the very beginning, up to and including root = connection.root() )
>> root['Vehicles']['Tesla Model S']['range'] 
'208 miles'

You can also display all of the keys (and do all other standard dictionary things you might want to do):

>> root['Vehicles']['Tesla Model S'].keys()
['acceleration', 'range', 'battery_options', 'base_price']

Last thing I want to mention is that keys can be changed: Changing the key value in python dictionary. Values can also be changed - so if your research results change because you change your method or something you don't have to start the entire database from scratch (especially if everything else is still okay). Be careful with doing both of these. I put in safety measures in my database code to make sure I'm aware of my attempts to overwrite keys or values.

** ADDED **

# added imports
import numpy as np
from tempfile import TemporaryFile
outfile = TemporaryFile()

# insert into definition/population section
np.save(outfile,np.linspace(-1,1,10000))
root['Vehicle']['Tesla Model S']['arraydata'] = outfile

# check to see if it worked
>>> root['Vehicle']['Tesla Model S']['arraydata']
<open file '<fdopen>', mode 'w+b' at 0x2693db0>

outfile.seek(0)# simulate closing and re-opening
A = np.load(root['Vehicle']['Tesla Model S']['arraydata'])

>>> print A
array([-1.        , -0.99979998, -0.99959996, ...,  0.99959996,
    0.99979998,  1.        ])

You could also use numpy.savez() for compressed saving of multiple numpy arrays in this exact same way.

Community
  • 1
  • 1
astromax
  • 6,001
  • 10
  • 36
  • 47
  • This seems to nicely store dictionaries. However, I don't see much about storing numpy arrays. Since many of my dictionary tree leaves are numpy arrays, it's important that these are optimally stored. Pickling inflates the size, numpy.save stores the binary data as is, while PyTables can compress low entropy data. If ZODB doesn't have explicit support, it will at best fall back to pickling the data, which is not up to snuff I'm afraid. – Gustav Larsson Aug 24 '13 at 03:36
  • Two things: 1) Why do you think it's unable to store numpy arrays? As arguments of my keys I could have easily stored a numpy array. 2) I just tested numpy.save() and it seems to be working in my version of python-3.3.1 just fine. Are you saying the syntax is different? If so, there are tools (http://docs.python.org/2/library/2to3.html) to convert between the two. If you still have problems why don't you just check the version of python (import platform \n platform.python_version())that it's defaulting to, and adjust the syntax in a conditional statement? – astromax Aug 24 '13 at 16:53
  • You could even store a file as an argument to a key, allowing you to compress your numpy array using numpy.save(). – astromax Aug 24 '13 at 17:02
  • 2
    1) I couldn't find a mention about it in their documentation, which probably means it will pickle the arrays at best, which is a deal-breaker 2) I use numpy.load/save a lot and you can even save dictionaries directly, exactly to the specification that I want. The problem is that a file saved in Python 2 cannot be open in Python 3 and vice versa. This is not a syntax problem, but because numpy.load/save relies on standard library code for writing byte arrays, which is different between the pythons. – Gustav Larsson Aug 24 '13 at 17:51
  • Hmm, I think I'm starting to see your dilemma. Even if you used this object database to store files, you still couldn't open between 2.x and 3.x versions. Like you mentioned in your first post, if only there was a function that played nicely with both 2 and 3. If such a tool existed you could use ZODB to organize your dictionaries the way you wanted (or save your dictionary out to file directly). I'm going to keep my eye out for something like this. – astromax Aug 24 '13 at 18:40
0

This is not a direct answer. Anyway, you may be interested also in JSON. Have a look at the 13.10. Serializing Datatypes Unsupported by JSON. It shows how to extend the format for unsuported types.

The whole chapter from "Dive into Python 3" by Mark Pilgrim is definitely a good read for at least to know...

Update: Possibly an unrelated idea, but... I have read somewhere, that one of the reasons why XML was finally adopted for data exchange in heterogeneous environment was some study that compared specialized binary format with zipped XML. The conclusion for you could be to use possibly not so space efficient solution and compress it via zip or another well known algorithm. Using the known algorithm helps when you need to debug (to unzip and then look at the text file by eye).

pepr
  • 20,112
  • 15
  • 76
  • 139
  • I think this is a great solution (on par with ZODB and the likes) if the main concern is to save dictionary structures with simple data. However, my focus is when the leaves often are very large numpy arrays, so they will completely ruin the human readability of the files with big chunks of byte code. Without the human readable aspect, I'm afraid a solution using JSON/XML isn't up to snuff in terms of optimized storage space, despite added compression (at least compared to PyTables). – Gustav Larsson Aug 24 '13 at 21:50