Python file parsing: Build tree from text file

Question

I have an indented text file that will be used to build a tree. Each line represents a node, and indents represent depth as well as node the current node is a child of.

For example, a file might look like

ROOT
   Node1
      Node2
         Node3
            Node4
   Node5
   Node6

Which indicates that ROOT contains three children: 1, 5, and 6, Node1 has one child: 2, and Node2 has one child: 3, etc.

I have come up with a recursive algorithm and have programmed it and it works, but it's kind of ugly and especially treats the example above very crudely (when going from node 4 to node 5)

It uses "indent count" as the basis for recursion, so if the number of indents = current depth + 1, I would go one level deeper. But this means when I read a line with less indents, I have to come back up one level at a time, checking the depth each time.

Here is what I have

def _recurse_tree(node, parent, depth):
    tabs = 0
    
    while node:
        tabs = node.count("\t")
        if tabs == depth:
            print "%s: %s" %(parent.strip(), node.strip())
        elif tabs == depth + 1:
            node = _recurse_tree(node, prev, depth+1)
            tabs = node.count("\t")
            
            #check if we have to surface some more
            if tabs == depth:
                print "%s: %s" %(parent.strip(), node.strip())
            else:
                return node
        else:
            return node
        
        prev = node
        node = inFile.readline().rstrip()
        
inFile = open("test.txt")
root = inFile.readline().rstrip()
node = inFile.readline().rstrip()
_recurse_tree(node, root, 1)

Right now I am just printing out the nodes to verify that the parent node is correct for each line, but maybe there is a cleaner way to do it? Especially the case in the elif block when I'm coming back from each recursion call.

You will have to write a bunch of code to make the right parsing of this and the validation. Can you use xml? Its base structure is a tree. — guax, May 20 '11 at 18:15
Unfortunately no, as this is more of a recursion exercise than anything. I assumed this sort of problem would be quite common though. — MxLDevs, May 20 '11 at 18:20
Might this be a homework question? If so, it's etiquette to add the homework tag. — Mu Mind, May 20 '11 at 18:40
If so, this is really not Python-specific. More of a general algorithm musing. — Santa, May 20 '11 at 19:06

pillmuncher · Answer 1 · 2011-06-03T01:22:33.380

20

If you don't insist on recursion, this works too:

from itertools import takewhile

is_tab = '\t'.__eq__

def build_tree(lines):
    lines = iter(lines)
    stack = []
    for line in lines:
        indent = len(list(takewhile(is_tab, line)))
        stack[indent:] = [line.lstrip()]
        print stack

source = '''ROOT
\tNode1
\t\tNode2
\t\t\tNode3
\t\t\t\tNode4
\tNode5
\tNode6'''

build_tree(source.split('\n'))

Result:

['ROOT']
['ROOT', 'Node1']
['ROOT', 'Node1', 'Node2']
['ROOT', 'Node1', 'Node2', 'Node3']
['ROOT', 'Node1', 'Node2', 'Node3', 'Node4']
['ROOT', 'Node5']
['ROOT', 'Node6']

edited Jun 03 '11 at 01:22

answered May 22 '11 at 16:37

pillmuncher

10,094
2
35
33

2

Great style. Especially `is_tab` definition. – Bunyk Jul 31 '14 at 12:38
1

Without `takewhile` (faster and--I think--cleaner): `for line in lines: body = line.lstrip('\t'); level = len(line) - len(body); stack[level:] = (body,)`. – Kirill Bulygin May 22 '17 at 16:56
This is so clean. – aheze Jul 10 '21 at 15:35

Mu Mind · Accepted Answer · 2011-05-22T14:19:17.340

13

The big issue is the "lookahead" that I think caused the ugliness in question. It can be shortened slightly:

def _recurse_tree(parent, depth, source):
    last_line = source.readline().rstrip()
    while last_line:
        tabs = last_line.count('\t')
        if tabs < depth:
            break
        node = last_line.strip()
        if tabs >= depth:
            if parent is not None:
                print "%s: %s" %(parent, node)
            last_line = _recurse_tree(node, tabs+1, source)
    return last_line

inFile = open("test.txt")
_recurse_tree(None, 0, inFile)

Since we're talking recursion, I took pains to avoid any global variables (source and last_line). It would be more pythonic to make them members on some parser object.

edited May 22 '11 at 14:19

answered May 20 '11 at 19:38

Mu Mind

10,935
4
38
69

@martineau: you're right, I meant to replace `inFile` with `source` inside the function, fixed that now. – Mu Mind May 20 '11 at 21:00
Also looks to me like the `last_line` argument is always passed in as `None` -- so it could be just be a local variable with an initial value of `source.readline().rstrip()` set before the `while` loop (and the check for it being `None` removed). – martineau May 22 '11 at 07:04
@martineau: right again, edited accordingly. When I was writing it, I was tinkering a bit and wasn't sure whether each recursion/return would correspond to moving to the next line of input. Since I mentioned this being a "shortened" version, I guess I'd better squeeze all the air out, huh? – Mu Mind May 22 '11 at 14:31

score 11 · Answer 3 · edited Mar 24 '20 at 19:48

I would not use recursion for something like this at all (Ok, maybe I would if I was coding this in a language like Scheme, but this is Python here). Recursion is great for iterating over data that is shaped like a tree, and in those cases it would simplify your design greatly when compared to normal loops.

However, this is not the case here. Your data surely represents a tree, but it's formatted sequentially, i.e. it is a simple sequence of lines. Such data is most easily processed with a simple loop, although you could make the design more general, if you wish, by separating it into three different layers: the sequential reader (which will parse the tabs as a specification of depth level), the tree inserter (which inserts a node into a tree in a specific depth level, by keeping track of the last node which was inserted into the tree) and the tree itself:

import re

# *** Tree representation ***
class Node(object):
    def __init__(self, title):
        self.title = title
        self.parent = None
        self.children = []

    def add(self, child):
        self.children.append(child)
        child.parent = self

# *** Node insertion logic ***
class Inserter(object):
    def __init__(self, node, depth = 0):
        self.node = node
        self.depth = depth

    def __call__(self, title, depth):
        newNode = Node(title)
        if (depth > self.depth):
            self.node.add(newNode)
            self.depth = depth
        elif (depth == self.depth):
            self.node.parent.add(newNode)
        else:
            parent = self.node.parent
            for i in xrange(0, self.depth - depth):
                parent = parent.parent
            parent.add(newNode)
            self.depth = depth

        self.node = newNode

# *** File iteration logic ***
with open(r'tree.txt', 'r') as f:    
    tree = Node(f.readline().rstrip('\n'))
    inserter = Inserter(tree)

    for line in f:
        line = line.rstrip('\n')
        # note there's a bug with your original tab parsing code:
        # it would count all tabs in the string, not just the ones
        # at the beginning
        tabs = re.match('\t*', line).group(0).count('\t')
        title = line[tabs:]
        inserter(title, tabs)

When I had to test this code before pasting it here, I wrote a very simple function to pretty print the tree that I read to memory. For this function, the most natural thing was to use recursion of course, because now the tree is indeed represented as tree data:

def print_tree(node, depth = 0):
    print '%s%s' % ('  ' * depth, node.title)
    for child in node.children:
        print_tree(child, depth + 1)

print_tree(tree)

Python file parsing: Build tree from text file

3 Answers3

Linked