Python regex for nested XML elements

Question

I have a text (not a properly formed XML document) with some words in XML tags like this:

We have Potter the <term attrib="LINE:246">wizard</term> interacting with<term attrib="LINE:36080">witches</term> and <term attrib="LINE:360">goblins</term>  talking about <term attrib="LINE:337"><term attrib="LINE:329"><term attrib="LINE:468">dark</term></term> <term attrib="LINE:375">arts</term></term> in regions to the east of Hogwarts.

I need to extract the terms in the XML tags. My problem is that I do not know what regex I should use to get a nested element like this:

<term><term>something</term><term>else</term></term>

I am using python, for my work, and I have tried the following for my work:

re.findall(r'(<term.+?</term>)', textfile)

But I get something like this:

<term><term>something</term>

This is bad, because I am missing the rest. I also tried the following greedy version(which is worse):

re.findall(r'(<term.+</term>)' , textfile)

Can you please help me?

You might find http://stackoverflow.com/questions/37113364/regex-for-nested-xml-attributes informative on the problems associated with attempting to parse nested XML with regex ... — Zero Piraeus, May 30 '16 at 15:33

oligofren · Accepted Answer · 2016-05-30T15:30:52.310

1

You are using the wrong tool for the job. Regex languages can't (normally) count, so using it for stuff like this will be extremely fragile. Use a proper xml parser with a nice front-end, like BeautifulSoup. It will save you time and getter better results that are less hackish than a regex ever will be.

See the great docs for examples

edited May 30 '16 at 15:30

answered May 30 '16 at 15:25

oligofren

20,744
16
93
180

score -1 · Answer 2 · answered May 30 '16 at 15:30

Maybe try:

text = 'We have Potter the <term attrib="LINE:246">wizard</term> interacting with<term attrib="LINE:36080">witches</term> and <term attrib="LINE:360">goblins</term>  talking about <term attrib="LINE:337"><term attrib="LINE:329"><term attrib="LINE:468">dark</term></term> <term attrib="LINE:375">arts</term></term> in regions to the east of Hogwarts.'
text = re.sub("<.+?>", '', text)
text = re.sub("  ", " ", text)
print(text)

This should cut out every <tag> and </tag> there is, leaving everything else intact.

Of course, it will be messy if there are any < signs that aren't part of an XML tag.

Python regex for nested XML elements

2 Answers2