7

I am implementing a Trie in python. Till now I have come across two different methods to implement it:

1) use a class Node (similar to struct Node in C++) with data members:

char - to store character
is_end - to store end of word (true or false)
prefix_count - store number of words with current prefix

child - Node type dict (to store other nodes i.e. for 26 alphabets)

class Node(object):
    def __init__(self):
        self.char = ''
        self.word = ''
        self.is_end = False
        self.prefix_count = 0
        self.child = {}

2) use a dictionary to store all the data:

e.g. for the input words = {'foo', 'bar', 'baz', 'barz'}

we derive this dictionary:

         {'b': {'a': {'r': {'_end_': '_end_', 'z': {'_end_': '_end_'}},
          'z': {'_end_': '_end_'}}},
          'f': {'o': {'o': {'_end_': '_end_'}}}}

Which is the efficient and standard data structure, which is both memory efficient and fast for traversal and other trie operations on big dataset of words?

smci
  • 32,567
  • 20
  • 113
  • 146
divyum
  • 1,286
  • 13
  • 20
  • 1
    https://github.com/kmike/marisa-trie – Padraic Cunningham Apr 19 '15 at 18:59
  • 1
    How do you plan to reference objects of `Node` in `self.child`, considering this is a dictionary? If indeed you keep it as a `dict`, and somehow refer `Node` objects, I see both the methods as having same time complexity, but 1st one having more space complexity. And if you refer `self.child` as a list, then the 1st one might be a bit slower – hyades Apr 19 '15 at 19:02
  • Thanks for the reply. each child in Node will have another object of type Node, which will make it a n-ary tree. @hyades – divyum Apr 19 '15 at 19:04
  • As per me, first one is more structured and has rich information compared to the second one. Also first one will be helpful to add further functionalities like del, count, etc... but still want some opinions for the practical use. @hyades – divyum Apr 19 '15 at 19:09
  • http://stackoverflow.com/questions/11015320/how-to-create-a-trie-in-python – Padraic Cunningham Apr 19 '15 at 19:09
  • @PadraicCunningham already referred it. My question is based on the same topic but a little different. Also I am not going with the 'key' logic used in marisa-trie. – divyum Apr 19 '15 at 19:14
  • 1
    @divyum, Without specific details on what exactly you want to do you can only expect opinions based on peoples personal preference. The answers in that question discuss pros and cons and provide examples. – Padraic Cunningham Apr 19 '15 at 19:17
  • @PadraicCunningham, so requirement is to implement a simple trie but in such a way that it can be used for a big dataset of words (already mentioned in question). In that question, its mentioned "for a large, scalable trie, nested dictionaries might become cumbersome", so I am just asking is the first method better for large, scalable trie. P.S. I am not implementing the key logic used in marisa-trie. Hope the requirements are more clear now. – divyum Apr 19 '15 at 19:25
  • 2
    @divyum yes the first one allows you a more structured form which solves the cumbersome part, but at the same time occupies more space. If you are talking about huge datasets, the thing you worry about is the time complexity. You need a traversal to be O(len(search_str)) regardless of the trie. Both the above methods would yield the exact same thing. – hyades Apr 19 '15 at 20:19
  • If you use `dict` to implement trie, you're [almost] cheating. – Dima Tisnek May 07 '15 at 14:00
  • Note: trie is not an efficient data-structure. Use tree. See: https://www.cs.cmu.edu/~ckingsf/bioinfo-lectures/suffixtrees.pdf – user3639557 Apr 09 '16 at 10:17

3 Answers3

1

Why not both? Just yesterday I was implementing a similar data structure for storing and retrieving a hierarchy of objects and contemplated this exact situation. Ended up using a Node object with a children dictionary. The main node being an object allows you to have custom methods for printing it or getting stuff, and you can even have a lazy initialization if needed (you mentioned big dataset right?)

class Node(object):
    def __init__(self):
        self.children = collections.defaultdict(lambda:self.__class__())
        self.stuff = ...
    def __str__(self,depth=0):
        if self.children:
            return str(self.stuff)+'\n'+'\n'.join([' '*depth+k+' ' + v.__str__(depth+1) for k,v in self.children.items()])
        else:
            return str(self.stuff)
    def get_path(self,path):
        node = self
        while path:
            node = node.children[path.pop(0)]
        ...
Josep Valls
  • 5,483
  • 2
  • 33
  • 67
  • yes, I mentioned big dataset. It can obviously depends on the requirements but my reason was to make it structured along with some other functionalities like counting prefixes, suggesting words, etc... With the first implementation we can add more functionalities but in the second one we shall be bound to specific ones. I wanted to make it more scalable and real time. – divyum Apr 30 '15 at 21:34
1

A direct replacement would be nested a list;

However [arguably] more Pythonic, more compact in memory and thus faster for lookup would be a nested tuple.

Of course, updating such trie becomes logN, as you'll have to recreate every ancestor node. Still, if your lookups are a lot more frequent than updates, it may be worth it.

Dima Tisnek
  • 11,241
  • 4
  • 68
  • 120
-3

Trie is failure when it comes to space complexity. Trie tends to use lots of memory for processing and operating. But to avoid this problem there is a datastructure know as succinct data structure. Try implementing that here.

Refer here for more information.

Ashik Vetrivelu
  • 1,021
  • 1
  • 9
  • 24