Parsing a string for nested patterns

Question

What would be the best way to do this.

The input string is

<133_3><135_3><116_2>The other system worked for about 1 month</116_2> got some good images <137_3>on it then it started doing the same thing as the first one</137_3> so then I quit using either camera now they are just sitting and collecting dust.</135_3></133_3>

the expected output is

{'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit \
using either camera now they are just sitting and collecting dust.':[133, 135],

'The other system worked for about 1 month': [116],

'on it then it started doing the same thing as the first one':[137]

}

that seems like a recursive regexp search but I can't figure out how exactly.

I can think of a tedious recursive function as of now, but have a feeling that there should be a better way.

Related question: Can regular expressions be used to match nested patterns?

score 4 · Answer 1 · answered Dec 01 '08 at 10:15

Take an XML parser, make it generate a DOM (Document Object Model) and then build a recursive algorithm that traverses all the nodes, calls "text()" in each node (that should give you the text in the current node and all children) and puts that as a key in the dictionary.

Daniel Naab · Accepted Answer · 2008-12-01T10:37:13.950

Use expat or another XML parser; it's more explicit than anything else, considering you're dealing with XML data anyway.

However, note that XML element names can't start with a number as your example has them.

Here's a parser that will do what you need, although you'll need to tweak it to combine duplicate elements into one dict key:

from xml.parsers.expat import ParserCreate

open_elements = {}
result_dict = {}

def start_element(name, attrs):
    open_elements[name] = True

def end_element(name):
    del open_elements[name]

def char_data(data):
    for element in open_elements:
        cur = result_dict.setdefault(element, '')
        result_dict[element] = cur + data

if __name__ == '__main__':
    p = ParserCreate()

    p.StartElementHandler = start_element
    p.EndElementHandler = end_element
    p.CharacterDataHandler = char_data

    p.Parse(u'<_133_3><_135_3><_116_2>The other system worked for about 1 month</_116_2> got some good images <_137_3>on it then it started doing the same thing as the first one</_137_3> so then I quit using either camera now they are just sitting and collecting dust.</_135_3></_133_3>', 1)

    print result_dict

jfs · Answer 3 · 2008-12-01T14:20:22.000

2

from cStringIO   import StringIO
from collections import defaultdict
####from xml.etree   import cElementTree as etree
from lxml import etree

xml = "<e133_3><e135_3><e116_2>The other system worked for about 1 month</e116_2> got some good images <e137_3>on it then it started doing the same thing as the first one</e137_3> so then I quit using either camera now they are just sitting and collecting dust. </e135_3></e133_3>"

d = defaultdict(list)
for event, elem in etree.iterparse(StringIO(xml)):
    d[''.join(elem.itertext())].append(int(elem.tag[1:-2]))

print(dict(d.items()))

Output:

{'on it then it started doing the same thing as the first one': [137], 
'The other system worked for about 1 month': [116], 
'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit using \
either camera now they are just sitting and collecting dust. ': [133, 135]}

edited Dec 01 '08 at 14:20

answered Dec 01 '08 at 11:40

jfs

399,953
195
994
1,670

Sebastian Did you test it? For me it fails with the following traceback Traceback (most recent call last): File "recurse_regex.py", line 55, in d[''.join(elem.itertext())].append(int(elem.tag[1:-2])) AttributeError: 'etree._Element' object has no attribute 'itertext' – JV. Dec 02 '08 at 05:07
@JV: Yes, I've run it. 'xml.etree.cElementTree has no 'itertext', but 'lxml.etree' has. 'lxml' is not a part of Python's standard library, but 'cElementTree' is. I don't know what is better for OP: to install lxml 'easy_install lxml' or to write 'itertext' for cElementTree. – jfs Dec 02 '08 at 14:38

score 1 · Answer 4 · answered Dec 01 '08 at 09:20

1

I think a grammar would be the best option here. I found a link with some information: http://www.onlamp.com/pub/a/python/2006/01/26/pyparsing.html

answered Dec 01 '08 at 09:20

Gonzalo Quero

3,303
18
23

score 1 · Answer 5 · answered Dec 01 '08 at 10:45

Note that you can't actually solve this by a regular expression, since they don't have the expressive power to enforce proper nesting.

Take the following mini-language:

A certain number of "(" followed by the same number of ")", no matter what the number.

You could make a regular expression very easily to represent a super-language of this mini-language (where you don't enforce the equality of the number of starts parentheses and end parentheses). You could also make a regular expression very easilty to represent any finite sub-language (where you limit yourself to some max depth of nesting). But you can never represent this exact language in a regular expression.

So you'd have to use a grammar, yes.

Regular expressions in real languages might be more powerful than regular expressions. — jfs, Dec 01 '08 at 15:40

score 0 · Answer 6 · answered Dec 01 '08 at 14:02

Here's an unreliable inefficient recursive regexp solution:

import re

re_tag = re.compile(r'<(?P<tag>[^>]+)>(?P<content>.*?)</(?P=tag)>', re.S)

def iterparse(text, tag=None):
    if tag is not None: yield tag, text
    for m in re_tag.finditer(text):
        for tag, text in iterparse(m.group('content'), m.group('tag')):
            yield tag, text

def strip_tags(content):
    nested = lambda m: re_tag.sub(nested, m.group('content'))
    return re_tag.sub(nested, content)


txt = "<133_3><135_3><116_2>The other system worked for about 1 month</116_2> got some good images <137_3>on it then it started doing the same thing as the first one</137_3> so then I quit using either camera now they are just sitting and collecting dust. </135_3></133_3>"
d = {}
for tag, text in iterparse(txt):
    d.setdefault(strip_tags(text), []).append(int(tag[:-2]))

print(d)

Output:

{'on it then it started doing the same thing as the first one': [137], 
 'The other system worked for about 1 month': [116], 
 'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit using \
 either camera now they are just sitting and collecting dust. ': [133, 135]}

Parsing a string for nested patterns

6 Answers6

Linked