2

I want to use nltk.containers.Trie to perform simple operations like inserting a word into the trie, retrieving all words with a given prefix, find nodes with most descendants (i.e. most common prefixes), graphically viewing the trie and so on. I couldn't find any documentation whatsoever regarding the use of this structure. Here's all I have so far:

from nltk.containers import Trie
t = Trie()

I now have a list of words which I need to add to the trie.

Velvet Ghost
  • 408
  • 1
  • 9
  • 27

1 Answers1

4

It's pretty cryptic, isn't it. It's basically a dictionary but you can additionally check if a string is a prefix of a known key:

>>> t = Trie()
>>> t['they'] = 15
>>> 'the' in t
True
>>> print t['the']
None

There's also find_prefix, which will match as much of its argument as possible, and return the value it finds there (if any) plus the remainder of the argument:

>>> t.find_prefix("theirs")
(None, 'irs')                 # Prefix "the" has no value

You could take a look at the source in nltk/containers.py. The magic is in the methods __setitem__ and __getitem__, which handle expressions of the form t[key].

Also good to know: The keys() method will only return real entries, not prefixes. You can use it with the method subtrie to retrieve all words that begin with a given prefix:

>>> t.subtrie('th').keys()
['ey']

PS. Note that containers.py was removed from the NLTK about six months ago! Before you update your nltk distribution (which you should), save nltk/containers.py under a different name. Better yet, just save the Trie class. (The rest of the file is obsolete).

alexis
  • 48,685
  • 16
  • 101
  • 161
  • Thanks, that was a lifesaver! However, is there any way find all words starting with a particular prefix, or the number of words with a prefix, or the 'height' of a node, and other such basic tree operations / manipulations? – Velvet Ghost Jun 06 '12 at 05:55
  • Take a look at the source in `nltk/containers.py`, it's pretty straightforward. A Trie is a dictionary plus what you'll find defined there. – alexis Jun 07 '12 at 16:45
  • Added info on getting words that start with a prefix; but I've no idea what you mean by the "height" of a node. – alexis Jun 07 '12 at 17:42
  • My language is not English. When I try to do (for example): t.subtrie('ওে').keys() (which should return a lot of things), I get the error: File "/usr/lib/python2.7/site-packages/nltk/containers.py", line 163, in subtrie curr_node = curr_node[1][char] KeyError: '\xe0'. What could be the problem? – Velvet Ghost Jun 08 '12 at 09:15
  • No idea, but it's probably caused by the fact that python does not know what language you're using. Use unicode strings and hope that `Trie` supports them. If it does not, I recommend fixing it to support unicode-- it's a very simple module, and you need to understand how python 2 handles unicode anyway. – alexis Jun 08 '12 at 21:53
  • If you can't sort it out, compose a new question and give a helpful description of the problem. – alexis Jun 08 '12 at 21:54