I am looking for some tips about how to decrease memory usage for python. I am using this piece of code as the main structure to hold my data:
http://stevehanov.ca/blog/index.php?id=114
I need it to serve for proximity word matching using a flask server. I need to put much more than 20 millions of different strings (and it will increase). Now I get MemoryError when trying to put around 14 millions in the Trie.
I just add a dictionary to hold some value with quick access (I need it, but it can be considered as a kind of ID of appearance, it is not directly related to the word)
class TrieNode:
values = {}
def __init__(self):
self.word = None
self.children = {}
global NodeCount
NodeCount += 1
def insert( self, word, value):
node = self
for letter in word:
if letter not in node.children:
node.children[letter] = TrieNode()
node = node.children[letter]
TrieNode.values[word] = value
node.word = word
I am not familiar with Python optimization, is there any way to make the "letter" object less big to save some memory?
Please note that my difficulty come from the fact that this letter is not only [a-z] but need to handle all the "unicode range" (like accentuated chars but not only). BTW it is a single character, so it should be quite light from the memory fingerprint. How can I use the codepoint instead of the string object (will it be more memory efficient)?
EDIT: adding some other informations following reply from @juanpa-arrivillaga
so, first I see no difference using the slot construct, on my computer, with or without __slot__
I see the same memory usage.
with __slot__
:
>>> class TrieNode:
NodeCount = 0
__slots__ = "word", "children"
def __init__(self):
self.word = None
self.children = {}
#global NodeCount # my goal is to encapsulated the NodeCount in the class itself
TrieNode.NodeCount += 1
>>> tn = TrieNode()
>>> sys.getsizeof(tn) + sys.getsizeof(tn.__dict__)
176
without __slot__
:
>>> class TrieNode:
NodeCount = 0
def __init__(self):
self.word = None
self.children = {}
#global NodeCount
TrieNode.NodeCount += 1
>>> tn = TrieNode()
>>> sys.getsizeof(tn) + sys.getsizeof(tn.__dict__)
176
so I do not understand, why. Where am i wrong ?
here is something else what I tried too, using "intern" keyword, because this value is a string handling an "id" (and so is not related to unicode, not like letter) :
btw my goal was to have with values and NodeCount, the equivalent concept for class/static variables so that each of them is shared by all the instance of the small created objets, I thought it would preserve memory and avoid duplicate, but I may be wrong from my understanding about "static-like" concept in Python)
class TrieNode:
values = {} # shared amon all instances so only one structure?
NodeCount = 0
__slots__ = "word", "children"
def __init__(self):
self.word = None
self.children = {}
#global NodeCount
TrieNode.NodeCount += 1
def insert( self, word, value = None):
# value is a string id like "XYZ999999999"
node = self
for letter in word:
codepoint = ord(letter)
if codepoint not in node.children:
node.children[codepoint] = TrieNode()
node = node.children[codepoint]
node.word = word
if value is not None:
lost = TrieNode.values.setdefault(word, [])
TrieNode.values[word].append(intern(str(value)))
ADDED: Last, I should have precised that i am using Python 2.7.x family.
I was wondering if there were any fixed len data types from library like numpy could help me to save some memory, again as new, i do not know where to look. Btw "word" are not real "natural language word" but "arbitrary length sequence of characters" and they can also be very long.
from your reply, I agree that avoiding to store the word in each node would be efficient, but you need to have a look to the linked article/piece of code. The main goal is not to reconstruct this word but to be able to do efficient/very fast approximate string matching using this word and then getting the "value" related to each of the closest matches, i am not sure i understood what was the goal of the path down to tree. (not reaching the complete tree?), and when matched we just need to get the orginal word matched, (but my understanding can be wrong at this point).
so I need to have this huge dict somewhere and I wanted to encapsulate in the class to be convenient. But so may be it is too much costly from the memory "weight" point of view ?
also I noticed that I get already less memory usage than your sample (I do not know why for now), but so here is an example value of "letter" contained in the structure.
>>> s = u"\u266f"
>>> ord(s)
9839
>>> sys.getsizeof(s)
28
>>> sys.getsizeof(ord(s))
12
>>> print s
♯
>>> repr(s)
"u'\\u266f'"