1

I have a a txt file representing a tree that is defined in the following BNF like format:

branches : '(' <value1> <value2> <n_children> <branches> ')'
         | ''

Each node for the tree has two values value1 and value2 that are integers, number of children that is an integer, and branches. an example of a subset of the file looks like:

(1796161 2205411 3  
    (1796288 2205425 0  )
    (1811141 2205419 1  
        (1811652 2205480 1  
            (1812161 2205496 4  
                (1812288 2205521 1  
                    (1812415 2205526 0  ))
                (1812034 2205516 0  )
                (1827651 2205510 2  
                    (1827906 2205581 2  
                        (1843777 2205588 2  
                            (1844032 2205626 1  
                                (1844159 2205632 0  ))
                            (1843138 2205617 0  ))
                        (1828288 2205591 1  
                            (1828161 2205597 0  )))
                    (1827012 2205563 0  ))
                (1811907 2205511 0  ))))
    (1796034 2205420 0  ))

Is there a nice way to parse this data using regular expression(s) that wouldn't involve me manually reading the file character by character keeping track of all the parantheses and preserves the relationship (parent-child) ordering.

Sid Shukla
  • 990
  • 1
  • 8
  • 33
  • The parentheses don't really matter here, do they? The format is equally valid without indentation or parentheses because it tells you how many children there are for the current node. So, when you read each line, you know how many children to read for that node. Once you finish, you go back to the parent node and continue reading for *that* node, and so on and so forth. – Rushy Panchal Dec 28 '16 at 16:54
  • @RushyPanchal The parentheses demarcate the end of a node. You're right that the number of children is already given and you know how many children to parse but I still need to keep track of number of opened and closed parentheses to keep track of where I am in the tree. – Sid Shukla Dec 28 '16 at 17:00
  • You don't need to keep track of the parentheses if you use a stack to keep track of where you are in the tree. Every time you descend down the tree (i.e. number of children > 0), push the current node onto the stack. When you finish reading the children, pop the parent and continue reading its children. It requires a bit more bookkeeping than just a stack (you need to keep track of how many children are left to read at each level), but it doesn't require you to know how the parentheses are nested. – Rushy Panchal Dec 28 '16 at 17:12
  • This recursive grammar could probably be converted to a [recursive regex in Python](https://stackoverflow.com/questions/26385984/recursive-pattern-in-regex). – Anderson Green Nov 26 '22 at 22:53

1 Answers1

3

This is easiest to do with a parser that lets you write your own grammar, like pyPEG. The following creates a tree of Node objects, where each Node can have zero or more children:

import re

from pypeg2 import attr, ignore, List, maybe_some, parse

Int = re.compile(r'[0-9]+')  # a positive integer

class Values(List):
    '''A pair of values associated with each node in the tree.

    For example, in the node

        ( 1 2 0 )

    the values are 1 and 2 (and 0 is the number of children).
    '''

    grammar = 2, Int

class Node(List):
    '''A node in the tree.

    Attributes:
        values      The pair of values associated with this node
        children    A list of child nodes
    '''

    def __repr__(self):
        return 'Values: ' + ', '.join(self.values)

# Grammar for Node is recursive (a Node can contain other Nodes),
# so we have to set it outside of the Node class definition.
Node.grammar = (
    '(',
    attr('values', Values),
    ignore(Int),
    attr('children', maybe_some(Node)),
    ')'
)

def print_tree(root, indent=0):
    '''Print a tree of Nodes recursively'''

    print(' ' * indent, end='')
    print(root)

    if len(root.children) > 0:
        for child in root.children:
            print_tree(child, indent + 2)


if __name__ == '__main__':

    tree = '''
        (1796161 2205411 3  
            (1796288 2205425 0  )
            (1811141 2205419 1  
                (1811652 2205480 1  
                    (1812161 2205496 4  
                        (1812288 2205521 1  
                            (1812415 2205526 0  ))
                        (1812034 2205516 0  )
                        (1827651 2205510 2  
                            (1827906 2205581 2  
                                (1843777 2205588 2  
                                    (1844032 2205626 1  
                                        (1844159 2205632 0  ))
                                    (1843138 2205617 0  ))
                                (1828288 2205591 1  
                                    (1828161 2205597 0  )))
                            (1827012 2205563 0  ))
                        (1811907 2205511 0  ))))
            (1796034 2205420 0  ))
    '''

    root = parse(tree, Node)
    print_tree(root)

Output:

Values: 1796161, 2205411
  Values: 1796288, 2205425
  Values: 1811141, 2205419
    Values: 1811652, 2205480
      Values: 1812161, 2205496
        Values: 1812288, 2205521
          Values: 1812415, 2205526
        Values: 1812034, 2205516
        Values: 1827651, 2205510
          Values: 1827906, 2205581
            Values: 1843777, 2205588
              Values: 1844032, 2205626
                Values: 1844159, 2205632
              Values: 1843138, 2205617
            Values: 1828288, 2205591
              Values: 1828161, 2205597
          Values: 1827012, 2205563
        Values: 1811907, 2205511
  Values: 1796034, 2205420
ThisSuitIsBlackNot
  • 23,492
  • 9
  • 63
  • 110