Regex named groups and conditional logic

Question

Consider the following string (edit: this is not a parsing HTML with regexs questions. Rather just an exercise with named groups):

s = """<T1>
        <A1>
        lorem ipsum
        </A1>
      </T1>"""

Is it possible to use re.sub and named groups to transform the string into this result?

<T1>
  <test number="1">
  lorem ipsum
  </test>
</T1>

Right now I have the following code:

import re
regex = re.compile("(<(?P<end>\/*)A(?P<number>\d+)>)")
print regex.sub('<\g<end>test number="\g<number>">', s)

which gives the following result

<T1>
  <test number="1">
  lorem ipsum
  </test number="1">
</T1>

Can an | operator be used like in this question?

never use regex for parsing html or xml , instead use proper modules that is for this tasl like `lxml` or ... — Mazdak, Feb 04 '15 at 09:23
I understand (normally I would always use `lmxl`). This is just an exercise for my understanding of `re`. — Jeff, Feb 04 '15 at 09:24

vks · Answer 1 · 2015-02-04T09:40:57.613

1

x="""<T1>
    <A1>
    lorem ipsum
    </A1>
  </T1>"""

def repl(obj):

    if obj.group(1):
        return '/test'
    else:
        return 'test number="'+obj.group(2)+'"'

print re.sub(r"(\/*)A(\d+)",repl,x)

You can tyr the replacement function provided by re.sub.

edited Feb 04 '15 at 09:40

answered Feb 04 '15 at 09:26

vks

67,027
10
91
124

@Jeff you re welcome.You can use this technique to attain difficult substitutes based on captures. – vks Feb 04 '15 at 10:07

Avinash Raj · Accepted Answer · 2015-02-04T09:47:19.523

1

Try to match the whole tag. Not only the opening and closing tags but catch also it's contents.

REgex:

(<(?P<end>\/*)(A)(?P<number>\d+)>)(.*?)</\3\4>

REplacement string:

<test number="\g<number>">\5</test>

DEMO

>>> s = """<T1>
        <A1>
        lorem ipsum
        </A1>
      </T1>"""
>>> import re
>>> print(re.sub(r'(?s)(<(?P<end>\/*)(A)(?P<number>\d+)>)(.*?)</\3\4>', r'<test number="\g<number>">\5</test>', s))
<T1>
        <test number="1">
        lorem ipsum
        </test>
      </T1>

(?s) called DOTALL modifier which matches makes dot in your regex to match even newline characters also.

edited Feb 04 '15 at 09:47

answered Feb 04 '15 at 09:29

Avinash Raj

172,303
28
230
274

That makes a lot of sense. I didn't think about matching against everything and then just including `` in the replacement string. Thanks! – Jeff Feb 04 '15 at 09:41

score 1 · Answer 3 · answered Feb 04 '15 at 09:37

You can use look-around to match string between <T1> and </T1> :

>>> p = re.compile(ur'(?<=<T1>)[^<]+?(.+)(?=</T1>)', re.MULTILINE | re.IGNORECASE | re.DOTALL)
>>> s2='\n  <test number="1">\n  lorem ipsum\n  </test>\n'
>>> print p.sub(s2,s,re.MULTILINE)
<T1>
  <test number="1">
  lorem ipsum
  </test>
</T1>

you need to use following Contents :

re.IGNORECASE Perform case-insensitive matching; expressions like [A-Z] will match lowercase letters, too. This is not affected by the current locale.

re.MULTILINE When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.

re.DOTALL Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

@Jeff you're welcome , note that using the following contents could be very helpful in such cases — Mazdak, Feb 04 '15 at 09:43

Regex named groups and conditional logic

3 Answers3