Implementing a Patricia Trie for use as a dictionary

Question

I'm attempting to implement a Patricia Trie with the methods addWord(), isWord(), and isPrefix() as a means to store a large dictionary of words for quick retrieval (including prefix search). I've read up on the concepts but they just aren't clarifying into an implementation. I want to know (in Java or Python code) how to implement the Trie, particularly the nodes (or should I implement it recursively). I saw one person who implemented it with an array of 26 child nodes set to null/None. Is there a better strategy (such as treating the letters as bits) and how would you implement it?

It is based off a homework assignment but the scope of this question is only a fraction of that assignment and is (particularly now) more relevant to my personal understanding of how tries work than getting a grade. — Regis Frey, Mar 09 '10 at 07:23
I just want to say that I wanted to explore Patricia tries once, too, and wound up giving up. I found the Knuth exposition too terse, and the original paper made no sense to me at all. — Jonathan Feinberg, Mar 09 '10 at 14:00
Based on trying this out in JavaScript, I think a realistic Python implementation would actually be a C extension. The same API on a sorted array was much faster than a trie in JavaScript, and I'd expect the same to hold true for Python. — Steven Huwig, Mar 09 '10 at 21:23

Justin Peel · Accepted Answer · 2011-04-01T02:20:23.387

Someone else asked a question about Patricia tries a while ago and I thought about making a Python implementation then, but this time I decided to actually give it a shot (Yes, this is way overboard, but it seemed like a nice little project). What I have made is perhaps not a pure Patricia trie implementation, but I like my way better. Other Patricia tries (in other languages) use just a list for the children and check each child to see there is a match, but I thought this was rather inefficient so I use dictionaries. Here is basically how I've set it up:

I'll start at the root node. The root is just a dictionary. The dictionary has keys that are all single characters (the first letters of words) leading to branches. The values corresponding with each key are lists where the first item is a string which gives the rest of the string that matches with this branch of the trie, and the second item is a dictionary leading to further branches from this node. This dictionary also has single character keys that correspond with the first letter of the rest of the word and the process continues down the trie.

Another thing I should mention is that if a given node has branches, but also is a word in the trie itself, then that is denoted by having a '' key in the dictionary that leads to a node with the list ['',{}].

Here's a small example that shows how words are stored (the root node is the variable _d):

>>> x = patricia()
>>> x.addWord('abcabc')
>>> x._d
{'a': ['bcabc', {}]}
>>> x.addWord('abcdef')
>>> x._d
{'a': ['bc', {'a': ['bc', {}], 'd': ['ef', {}]}]}
>>> x.addWord('abc')
{'a': ['bc', {'a': ['bc', {}], '': ['', {}], 'd': ['ef', {}]}]}

Notice that in the last case, a '' key was added to the dictionary to denote that 'abc' is a word in a addition to 'abcdef' and 'abcabc'.

Source Code

class patricia():
    def __init__(self):
        self._data = {}

    def addWord(self, word):
        data = self._data
        i = 0
        while 1:
            try:
                node = data[word[i:i+1]]
            except KeyError:
                if data:
                    data[word[i:i+1]] = [word[i+1:],{}]
                else:
                    if word[i:i+1] == '':
                        return
                    else:
                        if i != 0:
                            data[''] = ['',{}]
                        data[word[i:i+1]] = [word[i+1:],{}]
                return

            i += 1
            if word.startswith(node[0],i):
                if len(word[i:]) == len(node[0]):
                    if node[1]:
                        try:
                            node[1]['']
                        except KeyError:
                            data = node[1]
                            data[''] = ['',{}]
                    return
                else:
                    i += len(node[0])
                    data = node[1]
            else:
                ii = i
                j = 0
                while ii != len(word) and j != len(node[0]) and \
                      word[ii:ii+1] == node[0][j:j+1]:
                    ii += 1
                    j += 1
                tmpdata = {}
                tmpdata[node[0][j:j+1]] = [node[0][j+1:],node[1]]
                tmpdata[word[ii:ii+1]] = [word[ii+1:],{}]
                data[word[i-1:i]] = [node[0][:j],tmpdata]
                return

    def isWord(self,word):
        data = self._data
        i = 0
        while 1:
            try:
                node = data[word[i:i+1]]
            except KeyError:
                return False
            i += 1
            if word.startswith(node[0],i):
                if len(word[i:]) == len(node[0]):
                    if node[1]:
                        try:
                            node[1]['']
                        except KeyError:
                            return False
                    return True
                else:
                    i += len(node[0])
                    data = node[1]
            else:
                return False

    def isPrefix(self,word):
        data = self._data
        i = 0
        wordlen = len(word)
        while 1:
            try:
                node = data[word[i:i+1]]
            except KeyError:
                return False
            i += 1
            if word.startswith(node[0][:wordlen-i],i):
                if wordlen - i > len(node[0]):
                    i += len(node[0])
                    data = node[1]
                else:
                    return True
            else:
                return False

    def removeWord(self,word):
        data = self._data
        i = 0
        while 1:
            try:
                node = data[word[i:i+1]]
            except KeyError:
                print "Word is not in trie."
                return
            i += 1
            if word.startswith(node[0],i):
                if len(word[i:]) == len(node[0]):
                    if node[1]:
                        try:
                            node[1]['']
                            node[1].pop('')
                        except KeyError:
                            print "Word is not in trie."
                        return
                    data.pop(word[i-1:i])
                    return
                else:
                    i += len(node[0])
                    data = node[1]
            else:
                print "Word is not in trie."
                return


    __getitem__ = isWord

You may have noticed that at the end I set __getitem__ to the isWord method. This means that

x['abc']

will return whether 'abc' in the trie or not.

I think that maybe I should make a module out of this and submit it to PyPI, but it needs more testing and at least a removeWord method. If you find any bugs let me know, but it seems to be working pretty well. Also, if you see any big improvements in efficiency I would also like to hear about them. I've considered doing something about having empty dictionaries at the bottom of each branch, but I'm leaving it for now. These empty dictionaries may be replaced with data linked to the word to expand the uses of the implementation for instance.

Anyway, if you don't like the way I implemented it, at least maybe this will give you some ideas about how you would like to implement your own version.

@John Yes, but I labeled those that way to make the code easier for other to figure out. I might make those sorts of changes for the final version. Thanks for the input. — Justin Peel, Mar 09 '10 at 22:00
@John Yes, these are all good for making the size of the .py file smaller (which wasn't really my goal at this point), but how does it really help efficiency? — Justin Peel, Mar 10 '10 at 05:46
@Justin: Please have your irony detector checked out; it appears to be malfunctioning. — John Machin, Mar 10 '10 at 06:25
@John Irony doesn't transmit well through text. You could have been someone truly trying to be helpful like most people on here and I try to respond nicely even if it isn't a great comment/idea. Please don't waste my time with something as stupid as this. — Justin Peel, Mar 10 '10 at 07:23
@Justin, please see http://stackoverflow.com/questions/3121916/python-implementation-of-patricia-tries/3122502#3122502 -- there's a bug in this code of yours, which the OP encountered; I believe I found the bug and posted it in the answer there -- you may want to check my fix and either supply the correct one or confirm that mine is right (and in either case fix the code in this answer, too!-) -- thanks. — Alex Martelli, Jun 26 '10 at 02:41
@Alex, well spotted. I've put in the bug fix. I'm not in a place where I can actually test this very easily at the moment since my motherboard fried this week (using my wife's at the moment and trying not to clutter her computer up while I find a replacement), but it is clearly the problem. I don't even know if this is really a decent implementation of a patricia trie, but it is simple and at least based on a patricia trie. — Justin Peel, Jun 26 '10 at 04:41
@John Using short variable names doesn't optimize the code at all as far as processing is concerned, and using poor variable names like 'x' and 'y' is just an artifact lazy programming. It doesn't help in any way, and makes code much less readable. — monokrome, Mar 30 '11 at 07:41

score 3 · Answer 2 · answered Apr 16 '13 at 21:33

Here's a recursive implementation using more pythonic methods:

def matching_prefix_index(word1, word2):
    max_len = min(len(word1),len(word2))
    for i in range(max_len):
        if word2[i] != word1[i]:
            return i
    return max_len

class PatriciaTrie(object):
    def __init__(self):
        self._storage = {}
        self._complete_prefix_flag = False

    def _find_storage_key(self, word):
        for key in self._storage:
            prefix_index = matching_prefix_index(key, word)
            if prefix_index > 0:
                return (key, prefix_index)
        return (None, None)

    def add(self, word):
        if word == '':
            self._complete_prefix_flag = True
            return True

        key, prefix_index = self._find_storage_key(word)
        if key is not None:
            if prefix_index == len(key):
                return self._storage[key].add(word[len(key):])
            else:
                new_tree = PatriciaTrie()
                new_tree._storage[key[prefix_index:]] = self._storage.pop(key)
                self._storage[key[0:prefix_index]] = new_tree
                return new_tree.add(word[prefix_index:])
        else:
            self._storage[word] = PatriciaTrie()
            self._storage[word].add('')
            return True

    def remove(self, word):
        if word == '':
            self._complete_prefix_flag = False
            return True

        key, prefix_index = self._find_storage_key(word)
        if key is None or prefix_index != len(key):
            return False

        subword = word[prefix_index:]
        subtrie = self._storage[key]
        if subtrie.remove(subword):
            if (not subtrie._complete_prefix_flag) and len(subtrie._storage) == 0:
                self._storage.pop(key)
            return True
        else:
            return False

    def __contains__(self, word):
        if word == '':
            return self._complete_prefix_flag

        key, prefix_index = self._find_storage_key(word)
        if key is None or prefix_index != len(key):
            return False
        else:
            return (word[prefix_index:] in self._storage[key])

    def has_prefix(self, word):
        if word == '':
            return True

        key, prefix_index = self._find_storage_key(word)
        if key is None:
            return False
        elif len(key) > len(word):
            return (prefix_index == len(word))
        elif len(key) != prefix_index:
            return False
        else:
            return self._storage[key].has_prefix(word[prefix_index:])

Implementing a Patricia Trie for use as a dictionary

2 Answers2

Linked