Optimizing string parsing with Python

Question

I have string in the form 'AB(AB(DDC)C)A(BAAC)DAB(ABC)'.

Each character represents an element (A, B, C or D).
Between parentheses, on the right, there is the child of each element (which may be absent).

In example, having 'AB(AB(DDC)C)A(BAAC)DA', the top level would be AB(AB(DDC)C)A(BAAC)DA --> [A, B, A, D, A] and the corresponding children would be [None, AB(DDC)C, BAAC, None, None]. The children are to be parsed as well recursively.

I have implemented a solution here:

def parse_string(string):

    i = 0                                                                       
    parsed = []                                                                 

    while i < len(string):                                                      
        if string[i] in ('A', 'B', 'C', 'D'):                                        
            parsed.append([string[i], None])                                    
            i += 1                                                              
        elif string[i] == '(':                                                  
            open_brakets = 1                                                    
            i += 1                                                              
            j = i                                                               
            while open_brakets:                                                 
                if string[j] == '(':                                            
                    open_brakets += 1                                           
                elif string[j] == ')':                   
                    open_brakets -= 1                    
                j += 1
            # Parse the children as well
            parsed[-1][-1] = parse_string(string[i:j - 1])       
            i = j                                                               
        else:                                                                   
            i += 1                                                              

    return parsed

print parse_string('AB(AB(DDC)C)A(BAAC)DAB(ABC)')

Although I think it's a bit ugly and I'm sure it is not very efficient.

I wonder if there's a way to make this with Python in a cleaner/faster/more elegant way? Using external libraries is allowed (specially if they're written in C! :-P).

Update

Other examples of strings that should work:

ABC(DAB(ACB)BBB(AAA)ABC)DCB

In general, the length of the string is not limited, neither the number of children, nor their length, nor the number of nested levels.

For future reference, see also: http://stackoverflow.com/questions/1656859/how-can-a-recursive-regexp-be-implemented-in-python — Peque, May 18 '15 at 12:02

Lie Ryan · Accepted Answer · 2015-04-25T02:06:45.997

If you need to recursively parse the inner parentheses as well:

def parse_tree(tree, string, start=0):
    index = start
    while index < len(string):
        current = string[index]
        if current == "(":
            child = tree[-1][1]
            child_parsed = parse_tree(child, string, index+1)
            index += child_parsed + 2 # adds 2 for the parentheses
        elif current == ")":
            break
        else:
            tree.append((current, []))
            index += 1
    return index - start
tree = []
print(parse_tree(tree, 'abc(abc(defg)d)de(f)gh'))

The way this works can be thought of like a state machine. The state machine accepts node definitions until it sees an open parentheses, in which it pushes a new context (i.e. a recursive function call) to the parsing stack to parse the content of the parentheses. When parsing the inner context, the close parentheses pops the context.

Another alternative, which can scale better if you have more complex grammars is to use a parsing library like PyParsing:

from pyparsing import OneOrMore, Optional, oneOf, alphas, Word, Group, Forward, Suppress, Dict

# define the grammar
nodes = Forward()
nodeName = oneOf(list(alphas))
nodeChildren = Suppress('(') + Group(nodes) + Suppress( ')')
node = Group(nodeName + Optional(nodeChildren))
nodes <<= OneOrMore(node)

print(nodes.parseString('abc(abc(defg)d)de(f)gh'))

Parsing libraries like PyParsing allows you to define an easy-to-read declarative grammar.

Answer to original non-recursive parsing: One way to do this is with itertools (accumulate is only from Python 3.2 and up, the itertools docs has a pure python implementation of accumulate for use in older versions). This avoids the use of indices:

from itertools import takewhile, accumulate
PARENS_MAP = {'(': 1, ')': -1}
def parse_tree(tree, string):
    string = iter(string)
    while string:
        current = next(string)
        if current == "(":
            child = iter(string)
            child = ((c, PARENS_MAP.get(c, 0)) for c in child)
            child = accumulate(child, lambda a,b: (b[0], a[1]+b[1]))
            child = takewhile(lambda c: c[1] >= 0, child)
            child = (c[0] for c in child)
            tree[-1][1] = "".join(child)
        else:
            tree.append([current, None])
print(parse_tree('abc(abc(defg)d)de(f)gh'))

I'm not quite sure whether it's faster or more elegant, but I think using explicit indexes is much easier to write, understand, and modify.

+1, nice pyparsing example! Nested expressions are so common, pyparsing also includes a helper method, `nestedExpr`, and the default nesting delimiters are left and right parentheses. With `nestedExpr`, you could write `print nestedExpr().parseString('('+s+')').asList()[0]` - since the original string is not itself a nested expression, we have to wrap it in parens, and then after parsing, take the 0'th element. Gives `['abc', ['abc', ['defg'], 'd'], 'de', ['f'], 'gh']`. But this is only to show what types of builtins come with pyparsing - your original answer was much more illustrative. — PaulMcG, May 04 '15 at 13:13

Mazdak · Answer 2 · 2015-04-22T13:52:55.670

1

You can use regex for parsing your text.

As a more general string consider the following string :

>>> s ='AB(AB(DDC)C)A(BAAC)DAB(ABC)DDD'

You can use re.findall to find the outer pattern :

>>> re.findall(r'(?<=\))\w+(?=\(|$)|^\w+(?=\()',s)
['AB', 'A', 'DAB', 'DDD']

And use that regex with re.split to get the strings bound within parenthesis :

>>> re.split(r'(?<=\))\w+(?=\(|$)|^\w+(?=\()',s)
['', '(AB(DDC)C)', '(BAAC)', '(ABC)', '']

A brife explain about the preceding regex :

This regex is contain of 2 part that concatenates with pip token (|) that works as logic or :

(?<=\))\w+(?=\(|$) :

this regex will match any combination of word characters (\w+) that precede by ) and followed by ( or $ that $ is the end of string modifier that match the end of string.

Note using $ is for the case DDD!

^\w+(?=\() :

this regex will match any combination of word characters that appears at the start of string (modifier ^ will match the start of string) and followed by (

edited Apr 22 '15 at 13:52

answered Apr 22 '15 at 13:45

Mazdak

105,000
18
159
188

Thanks for the tip. I'll check if that ends up being faster! :-) It is definitely more elegant, though. – Peque Apr 22 '15 at 13:48
@Peque Welcome! and if find this answer helpful you can tell this to community by voting or accepting the answer ;) – Mazdak Apr 22 '15 at 13:57
I will wait before accepting. Need to try it and also in case someone else wants to post their own answer. ;-) But don't worry, won't take longer than 24-48 hours! :-P (you've got my upvote already) – Peque Apr 22 '15 at 14:10
@Peque yeas its the way ;) – Mazdak Apr 22 '15 at 14:11
This fails to parse this input "ABC(GHI(MNO)PQR(STU)JKL)DEF". Unfortunately, I think regex cannot be used to parse this kind of inputs; a language that may contain arbitrary nesting of parentheses does not have a regular grammar, so it cannot be parsed with a regular expression. – Lie Ryan Apr 22 '15 at 15:50
@LieRyan: true, it fails with that case. – Peque Apr 23 '15 at 07:23
@LieRyan Yes, im agree that *a language that may contain arbitrary nesting of parentheses does not have a regular grammar* but i wrote this answer for this case, as OP doesn't says about alternative cases! – Mazdak Apr 23 '15 at 07:27

Mike Dunlavey · Answer 3 · 2015-04-22T13:41:20.977

As far as being a bit ugly, that's in the eye of the beholder.

As far as speed is concerned, it will be hard to improve on your code.

ADDED: Here is how I would do it in C++. You can adapt to Python if you care. This shows how to do it with recursion. The top-level function is topLevel("blah blah").

bool seeLetter(char* &s){
    if (*s=='A' || *s=='B' || *s=='C' || *s=='D'){
         s++;
         return true;
    } else {
         return false;
    }
}

bool seeChar(char* &s, char c){
    if (*s == c){s++; return true;}
    return false;
}

bool seeList(char* &s){
    while(*s){
        if (seeLetter(s)){
        } else if (seeChar(s, '(')){
            if (!seeList(s)) return false;
            if (!seeChar(s, ')')) return false;
        } else break;
    }
    return true;
}

bool topLevel(char* &s){
    if (!seeList(s)) return false;
    return (*s == '\0'); // check for garbage at end
}

Thanks for taking the time to write this answer. :-) Yeah, iterating over a string in C (using pointers) is much more elegant than in Python, but as stated in the question, what I'm looking for is a better or more efficient implementation in Python, not C/C++. ;-) — Peque, Apr 22 '15 at 13:47

Optimizing string parsing with Python

Update

3 Answers3

Linked