Python - different object creation methods - memory alloc

Question

I got a nice answer to my earlier question about de/serialization, which led me to create a method that either deserializes a defaultdict(list) from a file if it exists, or creates the dictionary itself if the file does not exist.

After implementing a simple code

try:
    #deserialize - this takes about 6 seconds
    with open('dict.flat') as stream:
        for line in stream:
            vals = line.split()
            lexicon[vals[0]] = vals[1:]
except:
    #create new - this takes about 40 seconds
    for word in lexicon_file:
        word = word.lower()
            for letter n-gram in word:
                lexicon[n-gram].append(word)
    #serialize - about 6 seconds
    with open('dict.flat', 'w') as stream:
        stream.write('\n'.join([' '.join([k] + v) for k, v in lexicon.iteritems()]))

I was a little shocked at the amount of RAM my script takes when deserializing from a file.

(The lexicon_file contains about 620 000 words and the processed defaultdict(list) contains 25 000 keys, while each key contains a list of between 1 and 133 000 strings (average 500, median 20). Each key is a letter bi/trigram and it's values are words that contain the key letter n-gram.)

When the script creates the lexicon anew, the whole process doesn't use much more than 160 MB of RAM - the serialized file itself is a little over 129 MB. When the script deserializes the lexicon, the amount of RAM taken by python.exe jumps up to 500 MB.

When I try to emulate the method of creating a new lexicon in the deserialization process with

#deserialize one by one - about 15 seconds
with open('dict.flat') as stream:
    for line in stream:
        vals = line.split()
        for item in vals[1:]:
            lexicon[vals[0]].append(item)

The results are exactly the same - except this code snippet runs significantly slower.

What is causing such a drastic difference in memory consumption? My first though was that since a lot of elements in the resulting lists are exactly the same, python somehow creates the dictionary more efficiently memory-wise with references - something there is no time for when deserializing and mapping whole lists to keys. But if that is the case, why is this problem not solved by appending the items one by one, exactly like creating a new lexicon?

edit: This topic was already discussed in this question (how have I missed it?!). Python can be forced to create the dictionary from references by using the intern() function:

#deserialize with intern - 45 seconds
with open('dict.flat') as stream:
    for line in stream:
        vals = line.split()
        for item in vals[1:]:
            lexicon[intern(vals[0])].append(intern(item))

This reduces the amount of RAM taken by the dictionary to expected values (160 MB), but the offset is that computational time is back to the same value as creating the dict anew, which completely negates the reason for serialization.

Have you considered that the file object may still be in memory (or, rather, that Python may not release the memory)? How are you checking memory usage? Does it go back down over time? — jonrsharpe, Apr 19 '14 at 09:06
@jonrsharpe I have tried a few methods of memory profiling, but in the end they were all checking the complete memory usage by python.exe. I don't believe that the memory is free and just kept by python for further use, because the rest of the script adds even more memory usage on top of this structure, even after calling GC (the rest of the script takes about 50 MB of RAM). — Deutherius, Apr 19 '14 at 11:18
You could use e.g. [`memory_profiler`](https://pypi.python.org/pypi/memory_profiler) to look into line-by-line usage — jonrsharpe, Apr 19 '14 at 11:54
I have used it, but it only works out final memory usage of a loop without providing further statistics like memory per loop etc. And since I am creating a dict of lists, I can't really not use a loop for deserialization. — Deutherius, Apr 19 '14 at 12:21

Python - different object creation methods - memory alloc

0 Answers0