I was wondering whether someone might know the answer to the following.
I'm using Python to build a character-based suffix tree. There are over 11 million nodes in the tree which fits in to approximately 3GB of memory. This was down from 7GB by using the slot class method rather than the Dict method.
When I serialise the tree (using the highest protocol) the resulting file is more than a hundred times smaller.
When I load the pickled file back in, it again consumes 3GB of memory. Where does this extra overhead come from, is it something to do with Pythons handling of memory references to class instances?
Update
Thank you larsmans and Gurgeh for your very helpful explanations and advice. I'm using the tree as part of an information retrieval interface over a corpus of texts.
I originally stored the children (max of 30) as a Numpy array, then tried the hardware version (ctypes.py_object*30
), the Python array (ArrayType
), as well as the dictionary and Set types.
Lists seemed to do better (using guppy to profile the memory, and __slots__['variable',...]
), but I'm still trying to squash it down a bit more if I can. The only problem I had with arrays is having to specify their size in advance, which causes a bit of redundancy in terms of nodes with only one child, and I have quite a lot of them. ;-)
After the tree is constructed I intend to convert it to a probabilistic tree with a second pass, but may be I can do this as the tree is constructed. As construction time is not too important in my case, the array.array() sounds like something that would be useful to try, thanks for the tip, really appreciated.
I'll let you know how it goes.