0

I have a text (not a properly formed XML document) with some words in XML tags like this:

We have Potter the <term attrib="LINE:246">wizard</term> interacting with<term attrib="LINE:36080">witches</term> and <term attrib="LINE:360">goblins</term>  talking about <term attrib="LINE:337"><term attrib="LINE:329"><term attrib="LINE:468">dark</term></term> <term attrib="LINE:375">arts</term></term> in regions to the east of Hogwarts.

I need to extract the terms in the XML tags. My problem is that I do not know what regex I should use to get a nested element like this:

<term><term>something</term><term>else</term></term>

I am using python, for my work, and I have tried the following for my work:

re.findall(r'(<term.+?</term>)', textfile)

But I get something like this:

<term><term>something</term>

This is bad, because I am missing the rest. I also tried the following greedy version(which is worse):

re.findall(r'(<term.+</term>)' , textfile)

Can you please help me?

E_Munch
  • 13
  • 4

2 Answers2

1

You are using the wrong tool for the job. Regex languages can't (normally) count, so using it for stuff like this will be extremely fragile. Use a proper xml parser with a nice front-end, like BeautifulSoup. It will save you time and getter better results that are less hackish than a regex ever will be.

See the great docs for examples

oligofren
  • 20,744
  • 16
  • 93
  • 180
-1

Maybe try:

text = 'We have Potter the <term attrib="LINE:246">wizard</term> interacting with<term attrib="LINE:36080">witches</term> and <term attrib="LINE:360">goblins</term>  talking about <term attrib="LINE:337"><term attrib="LINE:329"><term attrib="LINE:468">dark</term></term> <term attrib="LINE:375">arts</term></term> in regions to the east of Hogwarts.'
text = re.sub("<.+?>", '', text)
text = re.sub("  ", " ", text)
print(text)

This should cut out every <tag> and </tag> there is, leaving everything else intact.

Of course, it will be messy if there are any < signs that aren't part of an XML tag.

Maciek
  • 3,174
  • 1
  • 22
  • 26