Fortunately for you, re.sub
accepts a function as an argument as well. The function will recieve a "MatchObject" -- From there, you can get the matched groups by match.group(1), match.group(2)
, etc. etc. The return value of the function will be the string to replace the matched group in the input text.
def fn(match):
return unichr(int(match.group(1)))
re.sub('&#([^;]*);', fn, inputtext, flags=re.UNICODE)
If you really want, you can inline this and use a lambda -- But I think lambda makes it harder to read in this case1.
By the way, depending on your python version, there are better ways to un-escape html (as it will also handle the special escape sequences like '&'
:
Python2.x
>>> import HTMLParser
>>> s = 'Ell és la víctima que expia els nostres pecats, i no tan sols els nostres, sinó els del món sencer.'
>>> print HTMLParser.HTMLParser().unescape(s)
Ell és la víctima que expia els nostres pecats, i no tan sols els nostres, sinó els del món sencer.
Python3.x
>>> import html
>>> html.unescape(s)
reference
1especially if you give fn
a more sensible name ;-)