Python dictionary that only stores changes

Question

I have written a code for design optimization using the Inspyred library and its implementation of Genetic Algorithms. In essence, the optimization process creates a large number of variations on a single data structure, which is a nested dictionary in my case.

In order to reduce the amount of memory used in the process, I have been trying to create some kind of differential dictionary type, which only stores items which are different from the base dictionary. The reason for this is that in a typical case, 95% of the data in the data structure will not be modified in any of the variations, but any part of the data structure can contain variations. So for flexibility reasons, I would like to have a data type that behaves more or less like a dictionary, but only stores the changes.

This is the result of my attempt to create this:

#!/usr/bin/python

import unittest
import copy

global_base={}

class DifferentialDict(object):
    """
    dictionary with differential storage of changes
    all DifferentialDict objects have the same base dictionary
    """

    def __init__(self,base=None):
        global global_base

        self.changes={}

        if not base==None:
            self.set_base(base)

    def set_base(self,base):
        global global_base
        global_base=copy.deepcopy(base)

    def __copy__(self):
        return self

    def __deepcopy__(self):
        new=DifferentialDict()
        new.changes=copy.deepcopy(self.changes)
        return new

    def get(self):
        global global_base
        outdict=copy.deepcopy(global_base)
        for key in self.changes:
            outdict[key]=self.changes[key]
        return outdict

    def __setitem__(self,key,value):
        self.changes[key]=value

    def __getitem__(self,key):
        global global_base
        if key in self.changes:
            return self.changes[key]
        else:
            return global_base[key]

class TestDifferentialDict(unittest.TestCase):
    def test1(self):
        ldict={'a':{1:2,3:4},'b':{'c':[1,2,3],'d':'abc'}}
        ddict=DifferentialDict(base=ldict)

        self.assertEqual(ddict['a'],{1:2,3:4})
        ddict['a']=5
        self.assertEqual(ddict['a'],5)

    def test2(self):
        ldict={'a':{1:2,3:4},'b':{'c':[1,2,3],'d':'abc'}}
        ddict1=DifferentialDict(base=ldict)
        ddict2=DifferentialDict(base=ldict)

        ddict1['a'][3]=5
        ddict2['a'][3]=7
        self.assertEqual(ddict1['a'][3],5)
        self.assertEqual(ddict2['a'][3],7)

    def test3(self):
        ldict={'a':{1:2,3:4},'b':{'c':[1,2,3],'d':'abc'}}
        ddict1=DifferentialDict(base=ldict)
        ddict2=ddict1.__deepcopy__()

        ddict1['a'][3]=5
        ddict2['a'][3]=7
        self.assertEqual(ddict1['a'][3],5)
        self.assertEqual(ddict2['a'][3],7)



if __name__ == "__main__":
    unittest.main()

It works fine for a simple dictionary, but breaks down when new dictionaries are nested within the main dictionary. I understand that this happens because these second level dictionaries are real Python dictionaries instead of instantiations of my DifferentialDict, resulting in an overwriting of entries in global_base rather than changes in self.changes. However, they have to be because of the premise that all DifferentialDict instantiations share the same base dictionary. I could add an 'entry level' key to each DifferentialDict instantiation, but my feeling is that there is a more elegant solution which eludes me.

I would really appreciate any suggestions on how to get my differential dictionary working when nested. Thanks in advance!

I just found this library: [dictdiffer](https://github.com/inveniosoftware/dictdiffer) which might already implement it. On Hacker News there is a discussion about it where raymondh shows how to do this with non-nested dictionaries [in pure Python](https://news.ycombinator.com/item?id=5771696). — Georg Schölly, Mar 01 '16 at 14:10

Georg Schölly · Answer 1 · 2016-03-01T14:00:09.423

3

I don't have the time to try this right now (maybe a bit later), but here are two observations:

combined indexes

If you would use tuples as indexes, for example like this dict[(5,3,2)] you would not have this problem. If you base either your base dict or also the differential dicts on this, you could circumvent the problem.

Maybe you could even write some classes that rewrite dict[a][b][c]to dict[(a,b,c)] to make this internal change transparent.

global base

I don't understand why you use a global base. From my point of view this makes the code more complicated without adding anything. Why don't you just store the base as in:

def MyDict(collections.abc.MutableSequence):
    def __init__(self, base):
        self._base = base

my_global_base = dict()
d = MyDict(my_global_base)

d[2] = 'abc' # modifies self._base inside of the instance too, because it is the
             # same object

If you want to change all the content of the base, just delete all items using popitem() and then add new ones using update(). This way your code is more flexible and does not have any surprising behavior because of the global variables.

abstract base classes

When reimplementing classes like sequences, dictionaries etc. it might come in handy to use the abstract base classes provided by Python, they do some of the implementation work for you.

edited Mar 01 '16 at 14:00

answered Mar 01 '16 at 11:31

Georg Schölly

124,188
49
220
267

Thanks, Georg. As far as combining the keys in a tuple is concerned, I have considered this, but this results in a long key sequence for any element stored in the dictionary, which seems inefficient w.r.t. to the original aim of reducing the memory footprint. – Asellus aquaticus Mar 01 '16 at 12:34
As far as the global base is concerned, the purpose of this is that it is shared by all instances of the DifferentialDict. From your suggestion I understand that a reference to the base object is passed upon instantiation of a new MyDict object, rather than that a local copy within the MyDict object is created. Is this correct? (yes, my knowledge/understanding of this is incomplete). – Asellus aquaticus Mar 01 '16 at 12:38
And I will have a look at the abstract base classes, thanks for the suggestion. – Asellus aquaticus Mar 01 '16 at 12:39
(1) I don't think I understand. Having a dictionary with tuples as keys should be [about the same as using nested dictionaries](http://stackoverflow.com/questions/7147785/). If you're concerned about the memory the dictionary implementation uses, I don't think Python is the right language. Every object (even an integer) has a substantial overhead. – Georg Schölly Mar 01 '16 at 13:52
(2) Correct, everything gets [passed by assignment](https://docs.python.org/3/faq/programming.html#how-do-i-write-a-function-with-output-parameters-call-by-reference) into functions in Python If you do: `a = {'key': 'frog'} ; b = a ; b['key'] = 'cat'`, then also `a['key'] == 'cat'. That works the same if you pass a dict into a function. – Georg Schölly Mar 01 '16 at 13:57
(3) What you are trying to achieve is certainly also possible with nested dictionaries, I just think it's going to be more painful. – Georg Schölly Mar 01 '16 at 13:58
This is my concern: `len(cPickle.dumps({1:{2:{3:4,5:6},7:{8:9,10:11}}}))` gives 63, `len(cPickle.dumps({(1,2,3):4,(1,2,5):6,(1,7,8):9,(1,7,10):11}))` gives 80. Assuming that the cPickle route (http://stackoverflow.com/questions/13530762/how-to-know-bytes-size-of-python-object-like-arrays-and-dictionaries-the-simp) is an acceptable way to approximate memory consumption in this case. – Asellus aquaticus Mar 02 '16 at 07:36
@Asellusaquaticus: You just [nerd sniped](https://xkcd.com/356/) me. I tried this: http://pastebin.com/7abRfXs0. My measurements indicate that dicts with tuple indexes use roughly twice the amount of memory. However, if your innermost values are large, this is negligible. – Georg Schölly Mar 02 '16 at 12:56
I'll put a notch on my gun. You have convinced me to try the flat approach. Your help is appreciated! – Asellus aquaticus Mar 02 '16 at 14:15

Python dictionary that only stores changes

1 Answers1

combined indexes

global base

abstract base classes