9

I need help with this developing this algorithm that I'm working on. I have a an input of a tree in the following format:

( Root ( AB ( ABC ) ( CBA ) ) ( CD ( CDE ) ( FGH ) ) )

This looks this the following tree.

                   Root
                     |
                ____________
              AB           CD
              |             |  
       __________         ___________
      ABC      CBA        CDE      FGH

What the algorithm is suppose to is read the parenthetical format in and give the following output:

Root -> AB CD
AB -> ABC CBA
CD -> CDE FGH

It list the root and its children and all other parents that have children. I am not able to understand how to start up on this, Can someone help me gimme hint or give some references or links?

Hell Man
  • 833
  • 3
  • 18
  • 24

4 Answers4

4

Solution: the Tree class from module nltk

(aka Natural Language Toolkit)

Making the actual parsing

This is your input:

input = '( Root ( AB ( ABC ) ( CBA ) ) ( CD ( CDE ) ( FGH ) ) )'

And you parse it very simply by doing:

from nltk import Tree
t = Tree.fromstring(input)

Playing with the parsed tree

>>> t.label()
'Root'
>>> len(t)
2
>>> t[0]
Tree('AB', [Tree('ABC', []), Tree('CBA', [])])
>>> t[1]
Tree('CD', [Tree('CDE', []), Tree('FGH', [])])
>>> t[0][0]
Tree('ABC', [])
>>> t[0][1]
Tree('CBA', [])
>>> t[1][0]
Tree('CDE', [])
>>> t[1][1]
Tree('FGH', [])

As you seen, you can treat each node as a list of subtrees.

To pretty-print the tree

>>> t.pretty_print()
            Root            
      _______|________       
     AB               CD    
  ___|___          ___|___   
ABC     CBA      CDE     FGH
 |       |        |       |  
...     ...      ...     ...

To obtain the output you want

from sys import stdout

def showtree(t):
    if (len(t) == 0):
        return
    stdout.write(t.label() + ' ->')
    for c in t:
        stdout.write(' ' + c.label())
    stdout.write('\n')
    for c in t:
        showtree(c)

Usage:

>>> showtree(t)
Root -> AB CD
AB -> ABC CBA
CD -> CDE FGH

To install the module

pip install nltk

(Use sudo if required)

Kubuntuer82
  • 1,487
  • 3
  • 18
  • 40
3

A recursive descent parser is a simple form of parser that can parse many grammars. While the entire theory of parsing is too large for a stack-overflow answer, the most common approach to parsing involves two steps: first, tokenisation, which extracts subwords of your string (here, probably words like 'Root', and 'ABC', or brackets like '(' and ')'), and then parsing using recursive functions.

This code parses input (like your example), producing a so-called parse tree, and also has a function 'show_children' which takes the parse tree, and produces the children view of the expression as your question asked.

import re

class ParseError(Exception):
    pass

# Tokenize a string.
# Tokens yielded are of the form (type, string)
# Possible values for 'type' are '(', ')' and 'WORD'
def tokenize(s):
    toks = re.compile(' +|[A-Za-z]+|[()]')
    for match in toks.finditer(s):
        s = match.group(0)
        if s[0] == ' ':
            continue
        if s[0] in '()':
            yield (s, s)
        else:
            yield ('WORD', s)


# Parse once we're inside an opening bracket.
def parse_inner(toks):
    ty, name = next(toks)
    if ty != 'WORD': raise ParseError
    children = []
    while True:
        ty, s = next(toks)
        if ty == '(':
            children.append(parse_inner(toks))
        elif ty == ')':
            return (name, children)

# Parse this grammar:
# ROOT ::= '(' INNER
# INNER ::= WORD ROOT* ')'
# WORD ::= [A-Za-z]+
def parse_root(toks):
    ty, _ = next(toks)
    if ty != '(': raise ParseError
    return parse_inner(toks)

def show_children(tree):
    name, children = tree
    if not children: return
    print '%s -> %s' % (name, ' '.join(child[0] for child in children))
    for child in children:
        show_children(child)

example = '( Root ( AB ( ABC ) ( CBA ) ) ( CD ( CDE ) ( FGH ) ) )'
show_children(parse_root(tokenize(example)))
Paul Hankin
  • 54,811
  • 11
  • 92
  • 118
  • 2
    One of the flaws of Python (at least CPython) is a relatively shallow maximum stack depth. This is nice and clean code, but it relies on the Python stack, so it will be limited in how deep of a tree it can parse. Other solutions, such as the one I posted, can handle very deep trees. If this solved the problem I'd leave it as-is because I like the simplicity, but this could be rewritten to use a `list` as a stack to keep the state, and then it would be able to parse very deep trees as well. – steveha Nov 05 '13 at 06:29
1

Try this :

def toTree(expression):
    tree = dict()
    msg =""
    stack = list()
    for char in expression:
        if(char == '('):
            stack.append(msg)
            msg = ""
        elif char == ')':
            parent = stack.pop()
            if parent not in tree:
                tree[parent] = list()
            tree[parent].append(msg)
            msg = parent
        else:
            msg += char
    return tree

expression = "(Root(AB(ABC)(CBA))(CD(CDE)(FGH)))"
print toTree(expression)

It returns a dictionary, where the root can be accessed with the key ''. You can then do a simple BFS to print the output.

OUTPUT :

{
''    : ['Root'], 
'AB'  : ['ABC', 'CBA'], 
'Root': ['AB', 'CD'], 
'CD'  : ['CDE', 'FGH']
}

You will have to eliminate all the whitespaces in the Expression before you start, or ignore the inrrelevant charaters in the expression by adding the following as the very first line in the for-loop :

if char == <IRRELEVANT CHARACTER>:
    continue

The above code will run in O(n) time, where n is the length of the expression.

EDIT

Here is the printing function :

def printTree(tree, node):
    if node not in tree:
        return 
    print '%s -> %s' % (node, ' '.join(child for child in tree[node])) 
    for child in tree[node]:
        printTree(tree, child) 

The desired Output can be achieved by the following :

expression = "(Root(AB(ABC)(CBA))(CD(CDE)(FGH)))"
tree = toTree(expression)
printTree(tree, tree[''][0])

Output

Root -> AB CD
AB -> ABC CBA
CD -> CDE FGH

EDIT

Assuming the node names are not unique, we just have to give new names to the nodes. This can be done using :

def parseExpression(expression):
    nodeMap = dict()
    counter = 1
    node = ""
    retExp =""
    for char in expression:
        if char == '(' or char == ')' :
            if (len(node) > 0):
                nodeMap[str(counter)] = node;
                retExp += str(counter)
                counter +=1
            retExp += char
            node =""
        elif char == ' ': continue
        else :
            node += char
    return retExp,nodeMap

The print Function will now change to :

def printTree(tree, node, nodeMap):
    if node not in tree:
        return 
    print '%s -> %s' % (nodeMap[node], ' '.join(nodeMap[child] for child in tree[node])) 
    for child in tree[node]:
        printTree(tree, child, nodeMap)

The output can be obtained by using :

expression = " ( Root( SQ ( VBZ ) ( NP ( DT ) ( NN ) ) ( VP ( VB ) ( NP ( NN ) ) ) ))"
expression, nodeMap = parseExpression(expression)
tree = toTree(expression)
printTree(tree, tree[''][0], nodeMap)

Output :

Root -> SQ
SQ -> VBZ NP VP
NP -> DT NN
VP -> VB NP
NP -> NN
Kyuubi
  • 1,228
  • 3
  • 18
  • 32
  • Thats forming an undirected graph not a tree. There are multiple paths to reach the node NN and NP – Kyuubi Nov 04 '13 at 12:22
  • 1
    @HellMan :Made the appropriate changes to consider dublicate node names. The algorithm still runs with O(n) time complexity. – Kyuubi Nov 04 '13 at 13:07
  • Hey Kyuubi, I'm getting the following error: TypeError: tuple indices must be integers, not str. – Hell Man Nov 09 '13 at 00:10
0

I think the most popular solution for parsing in Python is PyParsing. PyParsing comes with a grammar for parsing S-expressions and you should be able to just use it. Discussed in this StackOverflow answer:

Parsing S-Expressions in Python

Community
  • 1
  • 1
steveha
  • 74,789
  • 21
  • 92
  • 117