python re.sub with variable

Question

Input text:

Ell &#233;s la v&#237;ctima que expia els nostres pecats, i no tan sols els nostres, sin&#243; els del m&#243;n sencer.

Expected output:

Ell és la víctima que expia els nostres pecats, i no tan sols els nostres, sinó els del món sencer.

Known facts: unichr(233)=é

for now i have

re.sub('&#([^;]*);', r'unichr(int(\1))', inputtext, flags=re.UNICODE)

and of course is not working, don't know how to pass function on \1

Any idea?

score 5 · Answer 1 · answered Jan 13 '15 at 00:25

5

Use a lambda function:

re.sub('&#([^;]*);', lambda match: unichr(int(match.group(1))), t, flags=re.UNICODE)

answered Jan 13 '15 at 00:25

Aran-Fey

39,665
11
104
149

This was very speedy @rawing, let me check – josifoski Jan 13 '15 at 00:26

score 4 · Accepted Answer · edited May 23 '17 at 11:58

Fortunately for you, re.sub accepts a function as an argument as well. The function will recieve a "MatchObject" -- From there, you can get the matched groups by match.group(1), match.group(2), etc. etc. The return value of the function will be the string to replace the matched group in the input text.

def fn(match):
  return unichr(int(match.group(1)))

re.sub('&#([^;]*);', fn, inputtext, flags=re.UNICODE)

If you really want, you can inline this and use a lambda -- But I think lambda makes it harder to read in this case¹.

By the way, depending on your python version, there are better ways to un-escape html (as it will also handle the special escape sequences like '&':

Python2.x

>>> import HTMLParser
>>> s = 'Ell &#233;s la v&#237;ctima que expia els nostres pecats, i no tan sols els nostres, sin&#243; els del m&#243;n sencer.'
>>> print HTMLParser.HTMLParser().unescape(s)
Ell és la víctima que expia els nostres pecats, i no tan sols els nostres, sinó els del món sencer.

Python3.x

>>> import html
>>> html.unescape(s)

reference

^{¹especially if you give fn a more sensible name ;-)}

@josifoski -- I realized that it looks like you're formating HTML strings. If so, there's a better way -- that doesn't involve regex on your part :-). See my update. — mgilson, Jan 13 '15 at 00:40
@mgilson tnx, much better way, yes i want to make 'readable' html text — josifoski, Jan 13 '15 at 00:42

python re.sub with variable

2 Answers2

Python2.x

Python3.x