5

I have a list of words. For example:

reel
road
root
curd

I would like to store this data in a manner that reflects the following structure:

Start -> r -> e -> reel
           -> o -> a -> road
                   o -> root
         c -> curd

It is apparent to me that I need to implement a tree. From this tree, I must be able to easily obtain statistics such as the height of a node, the number of descendants of a node, searching for a node and so on. Adding a node should 'automatically' add it to the correct position in the tree, since this position is unique.

It would also like to be able to visualize the data in the form of an actual graphical tree. Since the tree is going to be huge, I would need zoom / pan controls on the visualization. And of course, a pretty visualization is always better than an ugly one.

Does anyone know of a Python package which would allow me to achieve all this simply? Writing the code myself will take quite a while. Do you think http://packages.python.org/ete2/ would be appropriate for this task?

I'm on Python 2.x, btw.


I discovered that NLTK has a trie class - nltk.containers.trie. This is convenient for me, since I already use NLTK. Does anyone know how to use this class? I can't find any examples anywhere! For example, how do I add words to the trie?

Kev
  • 118,037
  • 53
  • 300
  • 385
Velvet Ghost
  • 408
  • 1
  • 9
  • 27

2 Answers2

4

ETE2 is an environment for tree exploration, in principle made for browsing, building and exploring phylogenetic trees, and i've used it long time ago for these purposes. But its possible that if you set your data properly, you could get it done.

You just have to place paretheses wherever you need to split your tree and create a branch. See the following example, taken from ETE doc. If you change these "(A,B,(C,D));" for your words/letters it should be done.

from ete2 import Tree
unrooted_tree = Tree( "(A,B,(C,D));" )
print unrooted_tree

output:

     /-A
    |
----|--B
    |
    |     /-C
     \---|
          \-D

...and this package will let u do most of the operations you want, giving u the chance to select every branch individually, and operating with it in an easy way. I recommend u to give a look to the tutorial anyway, not pretty difficult :)

georg
  • 211,518
  • 52
  • 313
  • 390
peixe
  • 1,272
  • 3
  • 14
  • 31
  • The problem is, I can't create my whole trie in the constructor like you've illustrated. It needs to be created one word at a time, and each word inserted in the correct position. Can ETE2 do this automatically for me? – Velvet Ghost Jun 04 '12 at 09:49
  • 1
    @VelvetGhost, I don't know quite much about tries, but... Could a possibility be to create manually each one of the trees separately, and then concatenate them in the preferred order, telling ETE where to connect the trees? And this, i'm sure ETE2 can do it... – peixe Jun 04 '12 at 10:21
  • It may be possible, but I'm sure there's a simpler way. I found a trie implementation (see below), but can't figure out how to use it! – Velvet Ghost Jun 04 '12 at 10:25
3

I think the following example does pretty much what you want, using the ETE toolkit.

from ete2 import Tree

words = [ "reel", "road", "root", "curd", "curl", "whatever","whenever", "wherever"]

#Creates a empty tree
tree = Tree()
tree.name = ""
# Lets keep tree structure indexed
name2node = {}
# Make sure there are no duplicates
words = set(words)
# Populate tree
for wd in words:
    # If no similar words exist, add it to the base of tree
    target = tree

    # Find relatives in the tree
    for pos in xrange(len(wd), -1, -1):
        root = wd[:pos]
        if root in name2node:
            target = name2node[root]
            break

    # Add new nodes as necessary
    fullname = root 
    for letter in wd[pos:]:
        fullname += letter 
        new_node = target.add_child(name=letter, dist=1.0)
        name2node[fullname] = new_node

        target = new_node

# Print structure
print tree.get_ascii()
# You can also use all the visualization machinery from ETE
# (http://packages.python.org/ete2/tutorial/tutorial_drawing.html)
# tree.show()

# You can find, isolate and operate with a specific node using the index
wh_node = name2node["whe"]
print wh_node.get_ascii()

# You can rebuild words under a given node
def recontruct_fullname(node):
    name = []
    while node.up:
        name.append(node.name)
        node = node.up
    name = ''.join(reversed(name))
    return name

for leaf in wh_node.iter_leaves():
    print recontruct_fullname(leaf)


                    /n-- /e-- /v-- /e-- /-r
               /e--|
     /w-- /h--|     \r-- /e-- /v-- /e-- /-r
    |         |
    |          \a-- /t-- /e-- /v-- /e-- /-r
    |
    |     /e-- /e-- /-l
----|-r--|
    |    |     /o-- /-t
    |     \o--|
    |          \a-- /-d
    |
    |               /-d
     \c-- /u-- /r--|
                    \-l
jhc
  • 1,671
  • 3
  • 13
  • 16