1

Is there a way in regex to find a string if it occurs twice in given structures (i.e. like in XML parsing)? This code obviously does not work as it finds the first tag and then the last closing tag:

re.findall(r'<(.+)>([\s\S]*)</(.+)>', s)

So is there a way to tell regex that the third match should be the same as the first?

Full code:

import re

s = '''<a1>
    <a2>
        1
    </a2>
    <b2>
        52
    </b2>
    <c2>
        <a3>
            Abc
        </a3>
    </c2>
</a1>
<b1>
    21
</b1>'''

matches = re.findall(r'<(.+)>([\s\S]*)</(.+)>', s)
for match in matches:
    print(match)

Result should be all the tags with the contents:

    [('a1', '\n    <a2>\n        1\n    </a2>\n    <b2>\n        52\n    </b2>\n    <c2>\n        <a3>\n            Abc\n        </a3>\n    </c2>\n'),
     ('a2', '\n        1\n    '),
      ...]

Note: I am not looking for a complete xml parsing package. The question is specificly about solving the given problem with regex.

mrCarnivore
  • 4,638
  • 2
  • 12
  • 29
  • 1
    I'd personally use a tag mactcher that looks like this, so you don't over-run the grep and slurp in tag boundaries with a '.' or '*' wildcards. <([^<>]+)> – DDeMartini Dec 08 '17 at 15:25
  • 1
    @DDeMartini: Good point! I will take that advice as well. – mrCarnivore Dec 08 '17 at 15:26
  • [Don't parse XML or HTML with Regex](https://stackoverflow.com/a/1732454/1739000). Instead, see: [How do I parse XML in Python?](https://stackoverflow.com/questions/1912434/how-do-i-parse-xml-in-python) – NH. Dec 08 '17 at 15:52
  • @NH.: Not really. The question was specificly about regex., I was not looking for a complete xml parsing package. The answers here are also completely different than in the post you mentioned. And the answers here are very valuable! – mrCarnivore Dec 08 '17 at 15:54
  • @NH.: Your warnings might be valid if I wanted to use the solution to parse a very complicated xml file with non-strict syntax or complicated very specific cases. However, I was just looking to solve the problem of a very simple xml parser for simple and short xml texts. The answers in this posts helped me greatly! – mrCarnivore Dec 08 '17 at 15:58
  • @mrCarnivore, the problem is, you think you want something simple, but the reality is: you will end up needing more information (the output you are getting is pretty ugly, I'm sure someone will want it cleaned up...), and things will get out of hand. – NH. Dec 08 '17 at 16:04
  • @NH.: Look at my own answer to this question I have now posted. That works quite nicely for my cases. In reality I most likely not even run unto nested xml tags but it would work nonetheless. – mrCarnivore Dec 08 '17 at 16:06

3 Answers3

4

You can use backreferences and simple recursion:

>>> def m(s):
...    matches = re.findall(r'<(.+)>([\s\S]*)</(\1)>', s)
...    for k,s2,_ in matches:
...        print (k,s2)
...        m(s2)
... 
>>> m(s)
('a1', '\n    <a2>\n  ...[dropped]...      </a3>\n    </c2>\n')
('a2', '\n        1\n    ')
('b2', '\n        52\n    ')
('c2', '\n        <a3>\n            Abc\n        </a3>\n    ')
('a3', '\n            Abc\n        ')
('b1', '\n    21\n')

More about backreferences from Microsoft Docs.

Edited

For extra fun, with generator. Thanks @mrCarnivore about your suggestion to remove if s:

>>> def m(s):
...    matches = re.findall(r'<(.+)>([\s\S]*)</(\1)>', s)
...    for k,s2,_ in matches:
...        yield (k,s2)
...        yield from m(s2)
... 
>>> for x in m(s):
...    x
... 
('a1', '\n    <a2>\ [....]     Abc\n        </a3>\n    </c2>\n')
('a2', '\n        1\n    ')
('b2', '\n        52\n    ')
('c2', '\n        <a3>\n            Abc\n        </a3>\n    ')
('a3', '\n            Abc\n        ')
('b1', '\n    21\n')
dani herrera
  • 48,760
  • 8
  • 117
  • 177
  • Thanks that was easier than I had hoped for! However, why do you check the value of `s` in the for loop? Did you mean to check `s2`? – mrCarnivore Dec 08 '17 at 15:23
  • 2
    The backreference is the proper way to do this. The `(\1)` is what's known as a backreference, and it matches the first captured match token in parentheses. You need this because you want to match the only with to get the proper content. In short, you're "referring back" to a previous match, which is what you need here. You also need the recursion, because matching tends to be "hungry." When you match the first 'a1' tag, it consumes the entire tag. So, without recursion, you'll only find 'a1' and 'b1'. – GaryMBloom Dec 08 '17 at 15:24
  • @Gary02127: That was not my question. I have understood the concept of backreference. What I did not get is why there is a line `if s:` in the function. At that position it does not make any sense to me... – mrCarnivore Dec 08 '17 at 15:26
  • Our posts overlapped. – GaryMBloom Dec 08 '17 at 15:28
  • @Gary02127: Oh, ok. So you just wanted to stress that the answer is the best way to solve this? – mrCarnivore Dec 08 '17 at 15:30
  • Yup. And to add a bit of explanation as to why. – GaryMBloom Dec 08 '17 at 15:32
  • @Gary02127: Thanks for the explanation! @danihp: Thanks a lot for the `yield from`. I have not had much experience with generators but very interested in learning to use it properly. So thanks for the opportunity! – mrCarnivore Dec 08 '17 at 15:34
1

I wouldn't do this because the recursive structures are difficult to parse with regexes. Python's re module doesn't support this. The alternative regex module does. However, I wouldn't do it.

A backreference can only bring you this far:

import re

s = '''<a1>
    <a2>
        1
    </a2>
    <b2>
        52
    </b2>
    <c2>
        <a3>
            Abc
        </a3>
    </c2>
</a1>
<b1>
    21
</b1>'''

matches = re.findall(r'<(.+)>([\s\S]*)</\1>', s) # mind the \1
for match in matches:
    print(match)

It will give you two matches: 1. the <a1> ... </a1> and <b1> ... </b1>.

Now imagine that some of your tags are having attributes. What if a tag can span more than one line? What about tags that close themselves? What about accidental spaces?

A html / xml parser can deal with all of this.

Tamas Rev
  • 7,008
  • 5
  • 32
  • 49
  • Thanks for the caveat. However, I wanted just a lightweight xml parser to parse simple xmls without self closing tags. Accidental spaces should not be there either. – mrCarnivore Dec 08 '17 at 15:33
0

Using the help danihp gave me in the answer and obeying the hint DDeMartini gave in the comment I was able to create a lightweight xml parser that returns a dict structure of the xml:

import re

def xml_loads(xml_text):
    matches = re.findall(r'<([^<>]+)>([\s\S]*)</(\1)>', xml_text)
    if not matches:
        return xml_text.strip()
    d = {}
    for k, s2, _ in matches:
        d[k] = xml_loads(s2)
    return d


s = '''<a1>
    <a2>
        1
    </a2>
    <b2>
        52
    </b2>
    <c2>
        <a3>
            Abc
        </a3>
    </c2>
</a1>
<b1>
    21
</b1>'''

d = xml_loads(s)
print(d)
mrCarnivore
  • 4,638
  • 2
  • 12
  • 29