6

We use a dict which contains about 4GB of data for data processing. It's convenient and fast.

The problem we are having is that this dict might grow over 32GB.

I'm looking for a way to use a dict (just like a variable with get()-method etc) which can be bigger than the available memory. It would be great if this dict somehow stored the data on disk and retrieved the data from disk when get(key) is called and value for the key is not in memory.

Preferably I wouldn't like to use an external service, like a SQL database.

I did find Shelve, but it seems to need the memory too.

Any ideas on how to approach this problem?

ddofborg
  • 2,028
  • 5
  • 21
  • 34
  • 2
    An SQL database doesn't have to be an "external service" - you can use SQLite. – bruno desthuilliers Dec 19 '13 at 15:51
  • You say you found shelve, but did you try it? I'm pretty sure it does what you're asking for. – bj0 Dec 19 '13 at 16:15
  • 1
    It seems that `shelve` needs the memory for everything you put in it. It's not like it stores just parts of the data in memory and the rest on disk. – ddofborg Dec 19 '13 at 16:24

5 Answers5

2

That sounds like you could use a key-value-store which are currently hyped under the buzzword of No-SQL. Good introduction about it can be found, for instance in

http://ayende.com/blog/4449/that-no-sql-thing-key-value-stores.

It is simply a database with the API you described.

jcklie
  • 4,054
  • 3
  • 24
  • 42
2

Use the pickle module to serialize the dictionary to disk. Then take successive values from the iterator of the dictionary and place them into your cache, initially. Then implement a cache scheme such as LRU; remove a dictionary item using the `popitem() method of dictionaries and add the previously accessed item in the case of LRU.

Ramchandra Apte
  • 4,033
  • 2
  • 24
  • 44
2

I couldn't find any (fast) module to do this and decided to create my own (my first python project - thanks @blubber for some ideas :P). You can find it on GitHub: https://github.com/ddofborg/diskdict Comments are welcome!

ddofborg
  • 2,028
  • 5
  • 21
  • 34
  • Keep in mind that when you are using a dict, its values (list/dict/object) will be kept as a reference. When you pickle the dict, all its values will be pickled too, but not as references. That means that when you update the value, it will not be repickled and become outdated. – ddofborg Nov 22 '14 at 20:38
1

If you don't want to use an SQL database (which is a reasonable solution to a problem like this) you'll have to either figure out a way to compress the data you're working with or use a library like this one (or your own) to do the mapping to disc yourself.

You can also look at this question for some more strategies.

Community
  • 1
  • 1
maxywb
  • 2,275
  • 1
  • 19
  • 25
0

/EDIT: This is now a Python module: fdict.

I had a similar issue but I was working on nested dicts. Because of the very nested nature of my dataset, all the available solutions would not fit to my use case: shelve, chest, shove, sqlite, zodb, and any other NoSQL db work mainly when you have one level, because they all pickle/JSON the values to serialize into the database, so all values after 1st level are serialized and it is not possible to do incremental update of the nested objects (you have to load the whole branch from the 1st-level node, modify, and then put it back), which is not possible if your nested objects are huge by themselves.

This means that these solutions will help when you have lots of 1st-level nodes, but not when you have a few 1st-level nodes and a deep hierarchy.

To counter that, I have made a custom dict class over the native Python dict, to internally represent nested dict in a flattened form: dict()['a']['b'] will internally be represented as dict()['a/b'].

Then on top of this "internally flattened" dict, you can now use any object to disk solution like shelve, which is what I used but you can easily replace with something else.

Here is the code:

import shelve

class fdict(dict):
    '''Flattened nested dict, all items are settable and gettable through ['item1']['item2'] standard form or ['item1/item2'] internal form.
    This allows to replace the internal dict with any on-disk storage system like a shelve's shelf (great for huge nested dicts that cannot fit into memory).
    Main limitation: an entry can be both a singleton and a nested fdict, and there is no way to tell what is what, no error will be shown, the singleton will always be returned.
    '''
    def __init__(self, d=None, rootpath='', delimiter='/', *args):
        if d:
            self.d = d
        else:
            self.d = {}
        self.rootpath = rootpath
        self.delimiter = delimiter

    def _buildpath(self, key):
        return self.rootpath+self.delimiter+key if self.rootpath else key

    def __getitem__(self, key):
        # Node or leaf?
        if key in self.d: # Leaf: return the value
            return self.d.__getitem__(key)
        else: # Node: return a new full fdict based on the old one but with a different rootpath to limit the results by default
            return fdict(d=self.d, rootpath=self._buildpath(key))

    def __setitem__(self, key, value):
        self.d.__setitem__(self._buildpath(key), value)

    def keys(self):
        if not self.rootpath:
            return self.d.keys()
        else:
            pattern = self.rootpath+self.delimiter
            lpattern = len(pattern)
            return [k[lpattern:] for k in self.d.keys() if k.startswith(pattern)]

    def items(self):
        # Filter items to keep only the ones below the rootpath level
        if not self.rootpath:
            return self.d.items()
        else:
            pattern = self.rootpath+self.delimiter
            lpattern = len(pattern)
            return [(k[lpattern:], v) for k,v in self.d.items() if k.startswith(pattern)]

    def values(self):
        if not self.rootpath:
            return self.d.values()
        else:
            pattern = self.rootpath+self.delimiter
            lpattern = len(pattern)
            return [v for k,v in self.d.items() if k.startswith(pattern)]

    def update(self, d2):
        return self.d.update(d2.d)

    def __repr__(self):
        # Filter the items if there is a rootpath and return as a new fdict
        if self.rootpath:
            return repr(fdict(d=dict(self.items())))
        else:
            return self.d.__repr__()

    def __str__(self):
        if self.rootpath:
            return str(fdict(d=dict(self.items())))
        else:
            return self.d.__str__()

class sfdict(fdict):
    '''A nested dict with flattened internal representation, combined with shelve to allow for efficient storage and memory allocation of huge nested dictionnaries.
    If you change leaf items (eg, list.append), do not forget to sync() to commit changes to disk and empty memory cache because else this class has no way to know if leaf items were changed!
    '''
    def __init__(self, *args, **kwargs):
        if not ('filename' in kwargs):
            self.filename = None
        else:
            self.filename = kwargs['filename']
            del kwargs['filename']
        fdict.__init__(self, *args, **kwargs)
        self.d = shelve.open(filename=self.filename, flag='c', writeback=True)

    def __setitem__(self, key, value):
        fdict.__setitem__(self, key, value)
        self.sync()

    def get_filename(self):
        return self.filename

    def sync(self):
        self.d.sync()

    def close(self):
        self.d.close()

Both fdict and sfdict are useable just like standard dict (but I did not implement all methods heh).

The full code is here: https://gist.github.com/lrq3000/8ce9174c1c7a5ef546df1e1361417213

This was further developped in a full Python module: fdict.

After benchmarking, this is about 10x slower than a dict when using indirect access (ie, x['a']['b']['c']), and about as fast when using direct access (ie, x['a/b/c']), although here I do not account for the overhead of shelve saving to a anydbm file, just of fdict data structure compared to dict.

gaborous
  • 15,832
  • 10
  • 83
  • 102