/EDIT: This is now a Python module: fdict.
I had a similar issue but I was working on nested dicts. Because of the very nested nature of my dataset, all the available solutions would not fit to my use case: shelve
, chest
, shove
, sqlite
, zodb
, and any other NoSQL db work mainly when you have one level, because they all pickle/JSON the values to serialize into the database, so all values after 1st level are serialized and it is not possible to do incremental update of the nested objects (you have to load the whole branch from the 1st-level node, modify, and then put it back), which is not possible if your nested objects are huge by themselves.
This means that these solutions will help when you have lots of 1st-level nodes, but not when you have a few 1st-level nodes and a deep hierarchy.
To counter that, I have made a custom dict class over the native Python dict, to internally represent nested dict in a flattened form: dict()['a']['b']
will internally be represented as dict()['a/b']
.
Then on top of this "internally flattened" dict, you can now use any object to disk solution like shelve
, which is what I used but you can easily replace with something else.
Here is the code:
import shelve
class fdict(dict):
'''Flattened nested dict, all items are settable and gettable through ['item1']['item2'] standard form or ['item1/item2'] internal form.
This allows to replace the internal dict with any on-disk storage system like a shelve's shelf (great for huge nested dicts that cannot fit into memory).
Main limitation: an entry can be both a singleton and a nested fdict, and there is no way to tell what is what, no error will be shown, the singleton will always be returned.
'''
def __init__(self, d=None, rootpath='', delimiter='/', *args):
if d:
self.d = d
else:
self.d = {}
self.rootpath = rootpath
self.delimiter = delimiter
def _buildpath(self, key):
return self.rootpath+self.delimiter+key if self.rootpath else key
def __getitem__(self, key):
# Node or leaf?
if key in self.d: # Leaf: return the value
return self.d.__getitem__(key)
else: # Node: return a new full fdict based on the old one but with a different rootpath to limit the results by default
return fdict(d=self.d, rootpath=self._buildpath(key))
def __setitem__(self, key, value):
self.d.__setitem__(self._buildpath(key), value)
def keys(self):
if not self.rootpath:
return self.d.keys()
else:
pattern = self.rootpath+self.delimiter
lpattern = len(pattern)
return [k[lpattern:] for k in self.d.keys() if k.startswith(pattern)]
def items(self):
# Filter items to keep only the ones below the rootpath level
if not self.rootpath:
return self.d.items()
else:
pattern = self.rootpath+self.delimiter
lpattern = len(pattern)
return [(k[lpattern:], v) for k,v in self.d.items() if k.startswith(pattern)]
def values(self):
if not self.rootpath:
return self.d.values()
else:
pattern = self.rootpath+self.delimiter
lpattern = len(pattern)
return [v for k,v in self.d.items() if k.startswith(pattern)]
def update(self, d2):
return self.d.update(d2.d)
def __repr__(self):
# Filter the items if there is a rootpath and return as a new fdict
if self.rootpath:
return repr(fdict(d=dict(self.items())))
else:
return self.d.__repr__()
def __str__(self):
if self.rootpath:
return str(fdict(d=dict(self.items())))
else:
return self.d.__str__()
class sfdict(fdict):
'''A nested dict with flattened internal representation, combined with shelve to allow for efficient storage and memory allocation of huge nested dictionnaries.
If you change leaf items (eg, list.append), do not forget to sync() to commit changes to disk and empty memory cache because else this class has no way to know if leaf items were changed!
'''
def __init__(self, *args, **kwargs):
if not ('filename' in kwargs):
self.filename = None
else:
self.filename = kwargs['filename']
del kwargs['filename']
fdict.__init__(self, *args, **kwargs)
self.d = shelve.open(filename=self.filename, flag='c', writeback=True)
def __setitem__(self, key, value):
fdict.__setitem__(self, key, value)
self.sync()
def get_filename(self):
return self.filename
def sync(self):
self.d.sync()
def close(self):
self.d.close()
Both fdict
and sfdict
are useable just like standard dict
(but I did not implement all methods heh).
The full code is here:
https://gist.github.com/lrq3000/8ce9174c1c7a5ef546df1e1361417213
This was further developped in a full Python module: fdict.
After benchmarking, this is about 10x slower than a dict when using indirect access (ie, x['a']['b']['c']
), and about as fast when using direct access (ie, x['a/b/c']
), although here I do not account for the overhead of shelve
saving to a anydbm
file, just of fdict
data structure compared to dict
.