Python memory serialisation

Question

I was wondering whether someone might know the answer to the following.

I'm using Python to build a character-based suffix tree. There are over 11 million nodes in the tree which fits in to approximately 3GB of memory. This was down from 7GB by using the slot class method rather than the Dict method.

When I serialise the tree (using the highest protocol) the resulting file is more than a hundred times smaller.

When I load the pickled file back in, it again consumes 3GB of memory. Where does this extra overhead come from, is it something to do with Pythons handling of memory references to class instances?

Update

Thank you larsmans and Gurgeh for your very helpful explanations and advice. I'm using the tree as part of an information retrieval interface over a corpus of texts.

I originally stored the children (max of 30) as a Numpy array, then tried the hardware version (ctypes.py_object*30), the Python array (ArrayType), as well as the dictionary and Set types.

Lists seemed to do better (using guppy to profile the memory, and __slots__['variable',...]), but I'm still trying to squash it down a bit more if I can. The only problem I had with arrays is having to specify their size in advance, which causes a bit of redundancy in terms of nodes with only one child, and I have quite a lot of them. ;-)

After the tree is constructed I intend to convert it to a probabilistic tree with a second pass, but may be I can do this as the tree is constructed. As construction time is not too important in my case, the array.array() sounds like something that would be useful to try, thanks for the tip, really appreciated.

I'll let you know how it goes.

score 9 · Accepted Answer · edited May 23 '17 at 10:34

If you try to pickle an empty list, you get:

>>> s = StringIO()
>>> pickle.dump([], s)
>>> s.getvalue()
'(l.'

and similarly '(d.' for an empty dict. That's three bytes. The in-memory representation of a list, however, contains

a reference count
a type ID, in turn containing a pointer to the type name and bookkeeping info for memory allocation
a pointer to a vector of pointers to actual elements
and yet more bookkeeping info.

On my machine, which has 64-bit pointers, the sizeof a Python list header object is 40 bytes, so that's one order of magnitude. I assume an empty dict will have similar size.

Then, both list and dict use an overallocation strategy to obtain amortized O(1) performance for their main operations, malloc introduces overhead, there's alignment, member attributes that you may or may not even be aware of and various other factors that get you the second order of magnitude.

Summing up: pickle is a pretty good compression algorithm for Python objects :)

I'm really impressed with Pickle, and there's even the possibility of reducing the file size by another 25% using pickletools optimize function. Pickle is sooooo efficient. :-) — Martyn, Jun 14 '11 at 00:20

score 3 · Answer 2 · answered May 18 '11 at 14:27

Do you construct your tree once and then use it without modifying it further? In that case you might want to consider using separate structures for the dynamic construction and the static usage.

Dicts and objects are very good for dynamic modification, but they are not very space efficient in a read-only scenario. I don't know exactly what you are using your suffix tree for, but you could let each node be represented by a 2-tuple of a sorted array.array('c') and an equally long tuple of subnodes (a tuple instead of a vector to avoid overallocation). You traverse the tree using the bisect-module for lookup in the array. The index of a character in the array will correspond to a subnode in the subnode-tuple. This way you avoid dicts, objects and vector.

You could do something similar during the construction process, perhaps using a subnode-vector instead of subnode-tuple. But this will of course make construction slower, since inserting new nodes in a sorted vector is O(N).

This difference between dynamic and static structures is also why the data is so much smaller on disk. It is stored as a compact static structure. Imagine how slow it would be each time you added a node somewhere in the middle of that chunk. — Gurgeh, May 18 '11 at 14:42

Python memory serialisation

2 Answers2