1

Consider the following string (edit: this is not a parsing HTML with regexs questions. Rather just an exercise with named groups):

s = """<T1>
        <A1>
        lorem ipsum
        </A1>
      </T1>"""

Is it possible to use re.sub and named groups to transform the string into this result?

<T1>
  <test number="1">
  lorem ipsum
  </test>
</T1>

Right now I have the following code:

import re
regex = re.compile("(<(?P<end>\/*)A(?P<number>\d+)>)")
print regex.sub('<\g<end>test number="\g<number>">', s)

which gives the following result

<T1>
  <test number="1">
  lorem ipsum
  </test number="1">
</T1>

Can an | operator be used like in this question?

Community
  • 1
  • 1
Jeff
  • 227
  • 1
  • 10
  • 1
    never use regex for parsing html or xml , instead use proper modules that is for this tasl like `lxml` or ... – Mazdak Feb 04 '15 at 09:23
  • 1
    I understand (normally I would always use `lmxl`). This is just an exercise for my understanding of `re`. – Jeff Feb 04 '15 at 09:24

3 Answers3

1
x="""<T1>
    <A1>
    lorem ipsum
    </A1>
  </T1>"""

def repl(obj):

    if obj.group(1):
        return '/test'
    else:
        return 'test number="'+obj.group(2)+'"'

print re.sub(r"(\/*)A(\d+)",repl,x)

You can tyr the replacement function provided by re.sub.

vks
  • 67,027
  • 10
  • 91
  • 124
  • @Jeff you re welcome.You can use this technique to attain difficult substitutes based on captures. – vks Feb 04 '15 at 10:07
1

Try to match the whole tag. Not only the opening and closing tags but catch also it's contents.

REgex:

(<(?P<end>\/*)(A)(?P<number>\d+)>)(.*?)</\3\4>

REplacement string:

<test number="\g<number>">\5</test>

DEMO

>>> s = """<T1>
        <A1>
        lorem ipsum
        </A1>
      </T1>"""
>>> import re
>>> print(re.sub(r'(?s)(<(?P<end>\/*)(A)(?P<number>\d+)>)(.*?)</\3\4>', r'<test number="\g<number>">\5</test>', s))
<T1>
        <test number="1">
        lorem ipsum
        </test>
      </T1>

(?s) called DOTALL modifier which matches makes dot in your regex to match even newline characters also.

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • That makes a lot of sense. I didn't think about matching against everything and then just including `` in the replacement string. Thanks! – Jeff Feb 04 '15 at 09:41
1

You can use look-around to match string between <T1> and </T1> :

>>> p = re.compile(ur'(?<=<T1>)[^<]+?(.+)(?=</T1>)', re.MULTILINE | re.IGNORECASE | re.DOTALL)
>>> s2='\n  <test number="1">\n  lorem ipsum\n  </test>\n'
>>> print p.sub(s2,s,re.MULTILINE)
<T1>
  <test number="1">
  lorem ipsum
  </test>
</T1>

you need to use following Contents :

re.IGNORECASE Perform case-insensitive matching; expressions like [A-Z] will match lowercase letters, too. This is not affected by the current locale.

re.MULTILINE When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.

re.DOTALL Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • @Jeff you're welcome , note that using the following contents could be very helpful in such cases – Mazdak Feb 04 '15 at 09:43