1

Consider the following code:

import random                                                                   

class Trie:                                                                     
    def __init__(self, children, end):                                          
        self.children = children                                                
        self.end = end                                                          

def trie_empty():                                                               
    return Trie(dict(), False)                                                  

def trie_insert(x, t):                                                          
    if not x:                                                                   
        t.end = True                                                            
        return                                                                  
    try:                                                                        
        t2 = t.children[x[0]]                                                   
    except KeyError:                                                            
        t2 = trie_empty()                                                       
        t.children[x[0]] = t2                                                     
    trie_insert(x[1:], t2)                                                      

def fill_dict(root):                                                            
    memo = dict()                                                               
    def fill(pfx='', depth=0):                                                  
        try:                                                                    
            memo[pfx]                                                           
        except KeyError:                                                        
            pass                                                                
        else:                                                                   
            return                                                              
        if depth > 6:                                                           
            return                                                              
        for ci in range(ord('a'), ord('d') + 1):                                
            fill(pfx + chr(ci), depth + 1)                                      
        bw = None                                                               
        memo[pfx] = None, bw                                                    
    fill()                                                                      
    # del memo                                                                  

def random_word():                                                              
    l = int(random.random() * 10)                                               
    w = ''.join([chr(int(random.random() * 26) + ord('a')) for _ in range(l)])  
    return w                                                                    

def main():                                                                     
    t = trie_empty()                                                            
    for _ in range(10000):                                                      
        trie_insert(random_word(), t)                                           

    while True:                                                                 
        fill_dict(t)                                                            

if __name__ == '__main__':                                                      
    main()

When I run this, it continues to use more memory until I kill it. If I uncomment the del memo, it runs while using a constant amount of memory. From this, I conclude that the local variable memo is not being cleaned up when fill_dict returns.

This behavior is really mysterious to me, especially because basically all of the above code is necessary to see this behavior. even the completely unused argument to fill_dict cannot be omitted for the program to use unbounded memory.

This is really frustrating. Surely a modern, garbage-collected language can clean up its own variables and I shouldn't have to manually delete function-local variables. Even C can clean up the stack when a function returns. Why can't Python (in this situation)?

Program man
  • 393
  • 3
  • 13
  • 2
    Why are `trie_empty` and `trie_insert` regular functions instead of methods? – chepner Feb 16 '19 at 18:43
  • 4
    I suspect this is because you are using a closure (`fill`) and closures are special in that they remember the context in which they are called. Given it uses `memo` it presumably keeps a handle on it so that it doesn't go out of scope. – match Feb 16 '19 at 18:44
  • @chepner Because I'm not really used to OOP, and it doesn't impact the weird behavior I'm seeing – Program man Feb 16 '19 at 18:44
  • 3
    `fill` (inside `fill_dict`) is bound to a closure created by calling `fill_dict`, and this closure keeps `memo` alive. What's not clear is what keeps the closure itself alive when `fill_dict` returns. If `root` had a reference to `fill` or one of the things alive inside `fill`, that would be why, but it doesn't. – torek Feb 16 '19 at 18:46
  • 1
    Oh I think I know, memo has a reference to bw, which is in fill, and fill has a reference to memo, so it's a circular reference and it never gets cleaned up – Program man Feb 16 '19 at 18:53
  • Aha, that's it. Break that reference and the reference-counting GC will clean up. Or you can invoke a full GC, or use a Python implementation that doesn't start with a reference-counting GC. – torek Feb 16 '19 at 19:10

1 Answers1

7

I think this question deserves an answer, now that between me and Program man—and match mentioned the same starting point in a comment—we've figured it out.

The module-level function fill_dict has an inner function fill:

def fill_dict(root):                                                            
    memo = dict()                                                               
    def fill(pfx='', depth=0):                                                  

This inner name fill is bound to the entity created by compiling its contents. That entity refers back to the name memo that is bound to a new, empty dictionary at entry to fill_dict, so that entity is itself a closure.

Now, closures can be garbage-collected, and Python does have a garbage collector. But CPython in particular has a two-layer collector: there's a sort of main, always-on, reference-count based collector, and then a true mark-and-sweep style GC that runs much less often. (See When does CPython garbage collect? and Why does python use both reference counting and mark-and-sweep for gc?)

Sidebar: what's wrong with reference counting collectors?

A reference-counting collector is defeated by cycles:

>>> x = []
>>> x.append(x)
>>> x
[[...]]

Here x is bound to a list whose first element is the list to which x is bound. That is, x[0] is x, and x[0][0] is x, and so on:

>>> x[0] is x
True
>>> x[0][0] is x
True

With this kind of loop, deleting x doesn't help because the list refers to itself. However, we can make a fancier loop:

>>> a = dict()
>>> b = dict()
>>> a['link-to-b'] = b
>>> b['link-to-a'] = a
>>> a
{'link-to-b': {'link-to-a': {...}}}
>>> b
{'link-to-a': {'link-to-b': {...}}}

Now if we kill off one of the links, the circularity vanishes:

>>> a['link-to-b'] = None
>>> a
{'link-to-b': None}
>>> b
{'link-to-a': {'link-to-b': None}}

and things will be fine again.

Back to the problem at hand

In this particular case, fill has a reference to the memo instance in its outer fill_dict, and one of the entries in memo is:

        memo[pfx] = None, bw                                                    

The variable bw itself is defined inside the closure, so memo[pfx] refers to the closure (or more accurately, to an entity within the closure), while the closure refers to memo, and that's our circular reference.

Hence, even when fill_dict returns, the reference count on the closure has not dropped to zero.

torek
  • 448,244
  • 59
  • 642
  • 775
  • 1
    Excellent write-up - I got as far as going 'closure! circular reference!' and ran for the hills :-D – match Feb 16 '19 at 19:41