Decreasing memory usage with python objects trie served using flask

Question

I am looking for some tips about how to decrease memory usage for python. I am using this piece of code as the main structure to hold my data:

http://stevehanov.ca/blog/index.php?id=114

I need it to serve for proximity word matching using a flask server. I need to put much more than 20 millions of different strings (and it will increase). Now I get MemoryError when trying to put around 14 millions in the Trie.

I just add a dictionary to hold some value with quick access (I need it, but it can be considered as a kind of ID of appearance, it is not directly related to the word)

  class TrieNode:
    values = {}
    def __init__(self):
        self.word = None
        self.children = {}

        global NodeCount
        NodeCount += 1

    def insert( self, word, value):
        node = self
        for letter in word:
            if letter not in node.children: 
                node.children[letter] = TrieNode()

            node = node.children[letter]
        TrieNode.values[word] = value
        node.word = word

I am not familiar with Python optimization, is there any way to make the "letter" object less big to save some memory?

Please note that my difficulty come from the fact that this letter is not only [a-z] but need to handle all the "unicode range" (like accentuated chars but not only). BTW it is a single character, so it should be quite light from the memory fingerprint. How can I use the codepoint instead of the string object (will it be more memory efficient)?

EDIT: adding some other informations following reply from @juanpa-arrivillaga

so, first I see no difference using the slot construct, on my computer, with or without __slot__ I see the same memory usage.

with __slot__ :

>>> class TrieNode:
    NodeCount = 0
    __slots__ = "word", "children"
    def __init__(self):

    self.word = None
    self.children = {}

    #global NodeCount # my goal is to encapsulated the NodeCount in the class itself
    TrieNode.NodeCount += 1


>>> tn = TrieNode()
>>> sys.getsizeof(tn) + sys.getsizeof(tn.__dict__)
176

without __slot__:

>>> class TrieNode:
    NodeCount = 0
    def __init__(self):

        self.word = None
        self.children = {}

        #global NodeCount
        TrieNode.NodeCount += 1


>>> tn = TrieNode()
>>> sys.getsizeof(tn) + sys.getsizeof(tn.__dict__)
176

so I do not understand, why. Where am i wrong ?

here is something else what I tried too, using "intern" keyword, because this value is a string handling an "id" (and so is not related to unicode, not like letter) :

btw my goal was to have with values and NodeCount, the equivalent concept for class/static variables so that each of them is shared by all the instance of the small created objets, I thought it would preserve memory and avoid duplicate, but I may be wrong from my understanding about "static-like" concept in Python)

class TrieNode:
    values = {}    # shared amon all instances so only one structure?
    NodeCount = 0
    __slots__ = "word", "children"
    def __init__(self):

      self.word = None
      self.children = {}

      #global NodeCount
      TrieNode.NodeCount += 1

    def insert( self, word, value = None):
        # value is a string id like "XYZ999999999"
        node = self
        for letter in word:
            codepoint = ord(letter) 
            if codepoint not in node.children: 
                 node.children[codepoint] = TrieNode()

        node = node.children[codepoint]

        node.word = word
        if value is not None:
             lost = TrieNode.values.setdefault(word, [])
             TrieNode.values[word].append(intern(str(value)))

ADDED: Last, I should have precised that i am using Python 2.7.x family.

I was wondering if there were any fixed len data types from library like numpy could help me to save some memory, again as new, i do not know where to look. Btw "word" are not real "natural language word" but "arbitrary length sequence of characters" and they can also be very long.

from your reply, I agree that avoiding to store the word in each node would be efficient, but you need to have a look to the linked article/piece of code. The main goal is not to reconstruct this word but to be able to do efficient/very fast approximate string matching using this word and then getting the "value" related to each of the closest matches, i am not sure i understood what was the goal of the path down to tree. (not reaching the complete tree?), and when matched we just need to get the orginal word matched, (but my understanding can be wrong at this point).

so I need to have this huge dict somewhere and I wanted to encapsulate in the class to be convenient. But so may be it is too much costly from the memory "weight" point of view ?

also I noticed that I get already less memory usage than your sample (I do not know why for now), but so here is an example value of "letter" contained in the structure.

>>> s = u"\u266f"
>>> ord(s)
9839
>>> sys.getsizeof(s)
28
>>> sys.getsizeof(ord(s))
12
>>> print s
♯
>>> repr(s)
"u'\\u266f'"

juanpa.arrivillaga · Answer 1 · 2018-02-05T09:11:00.210

Low hanging fruit: use __slots__ in your node class, otherwise, each TrieNode object is carrying around a dict.

class TrieNode:
    __slots__ = "word", "children"
    def __init__(self):
        self.word = None
        self.children = {}

Now, each TrieNode object will not carry around an attribute dict. Compare the sizes:

>>> class TrieNode:
...     def __init__(self):
...         self.word = None
...         self.children = {}
...
>>> tn = TrieNode()
>>> sys.getsizeof(tn) + sys.getsizeof(tn.__dict__)
168

Vs:

>>> class TrieNode:
...     __slots__ = "word", "children"
...     def __init__(self):
...         self.is_word = False
...         self.children = {}
...
>>> sys.getsizeof(tn)
56
>>> tn.__dict__
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'TrieNode' object has no attribute '__dict__'

Another optimization, use int objects. Small int objects are cached, it is probable most of your characters will be in that range anyway, but even if they aren't, an int, while still beefy in Python, is smaller than even a single character string:

>>> 'ñ'
'ñ'
>>> ord('ñ')
241
>>> sys.getsizeof('ñ')
74
>>> sys.getsizeof(ord('ñ'))
28

So you can do something like:

def insert( self, word, value):
    node = self
    for letter in word:
        code_point = ord(letter)
        if code_point not in node.children: 
            node.children[code_point] = TrieNode()

        node = node.children[code_point]
    node.is_word = True #Don't save the word, simply a reference to a singleton

Also, you are keeping around a class variable values dict that is growing enormously, but that information is redundant. You say:

I just add a dictionary to hold some value with quick access (I need it)

You can reconstruct the words from the path. It should be relatively fast, I would seriously consider against having this dict. Check out how much memory it requires simply to hold a million one-character strings:

>>> d = {str(i):i for i in range(1000000)}
>>> (sum(sizeof(k)+sizeof(v) for k,v in d.items()) + sizeof(d)) * 1e-9
0.12483203000000001

You could do something like:

class TrieNode:
    __slots__ = "value", "children"
    def __init__(self):
        self.value = None
        self.children = {}

    def insert( self, word, value):
        node = self
        for letter in word:
            code_point = ord(letter)
            if code_point not in node.children: 
                node.children[code_point] = TrieNode()

            node = node.children[code_point]
        node.value = value #this serves as a signal that it is a word


    def get(word, default=None):
        val = self._get_value(word)
        if val is None:
            return default
        else:
            return val

    def _get_value(self, word):
        node = self
        for letter in word:
            code_point = ord(letter)
            try:
                node = node.children[code_point]
            except KeyError:
                return None
        return node.value

i give some more informations, about the value thing. And i can not get some improvements, my memoryError still occur with the __slot__ :/ and i did not even get more object in memory, i count them, i still reach the limit at the same point. BTW great thank you for helping and giving clues :) — Greg P., Feb 05 '18 at 12:48

Decreasing memory usage with python objects trie served using flask

1 Answers1