I got a nice answer to my earlier question about de/serialization, which led me to create a method that either deserializes a defaultdict(list)
from a file if it exists, or creates the dictionary itself if the file does not exist.
After implementing a simple code
try:
#deserialize - this takes about 6 seconds
with open('dict.flat') as stream:
for line in stream:
vals = line.split()
lexicon[vals[0]] = vals[1:]
except:
#create new - this takes about 40 seconds
for word in lexicon_file:
word = word.lower()
for letter n-gram in word:
lexicon[n-gram].append(word)
#serialize - about 6 seconds
with open('dict.flat', 'w') as stream:
stream.write('\n'.join([' '.join([k] + v) for k, v in lexicon.iteritems()]))
I was a little shocked at the amount of RAM my script takes when deserializing from a file.
(The lexicon_file contains about 620 000 words and the processed defaultdict(list)
contains 25 000 keys, while each key contains a list of between 1 and 133 000 strings (average 500, median 20).
Each key is a letter bi/trigram and it's values are words that contain the key letter n-gram.)
When the script creates the lexicon anew, the whole process doesn't use much more than 160 MB of RAM - the serialized file itself is a little over 129 MB. When the script deserializes the lexicon, the amount of RAM taken by python.exe jumps up to 500 MB.
When I try to emulate the method of creating a new lexicon in the deserialization process with
#deserialize one by one - about 15 seconds
with open('dict.flat') as stream:
for line in stream:
vals = line.split()
for item in vals[1:]:
lexicon[vals[0]].append(item)
The results are exactly the same - except this code snippet runs significantly slower.
What is causing such a drastic difference in memory consumption? My first though was that since a lot of elements in the resulting lists are exactly the same, python somehow creates the dictionary more efficiently memory-wise with references - something there is no time for when deserializing and mapping whole lists to keys. But if that is the case, why is this problem not solved by appending the items one by one, exactly like creating a new lexicon?
edit: This topic was already discussed in this question (how have I missed it?!). Python can be forced to create the dictionary from references by using the intern()
function:
#deserialize with intern - 45 seconds
with open('dict.flat') as stream:
for line in stream:
vals = line.split()
for item in vals[1:]:
lexicon[intern(vals[0])].append(intern(item))
This reduces the amount of RAM taken by the dictionary to expected values (160 MB), but the offset is that computational time is back to the same value as creating the dict anew, which completely negates the reason for serialization.