I have a text (not a properly formed XML document) with some words in XML tags like this:
We have Potter the <term attrib="LINE:246">wizard</term> interacting with<term attrib="LINE:36080">witches</term> and <term attrib="LINE:360">goblins</term> talking about <term attrib="LINE:337"><term attrib="LINE:329"><term attrib="LINE:468">dark</term></term> <term attrib="LINE:375">arts</term></term> in regions to the east of Hogwarts.
I need to extract the terms in the XML tags. My problem is that I do not know what regex I should use to get a nested element like this:
<term><term>something</term><term>else</term></term>
I am using python, for my work, and I have tried the following for my work:
re.findall(r'(<term.+?</term>)', textfile)
But I get something like this:
<term><term>something</term>
This is bad, because I am missing the rest. I also tried the following greedy version(which is worse):
re.findall(r'(<term.+</term>)' , textfile)
Can you please help me?