How to change original match in re.sub

Question

I want to split text in my html using   tags. If the text is longer than 50 characters, I want to replace last space before 10 characters by  .

The text is in TEXT

For example cccc cc cccccc cccc cc c

Will became: cccc cc cccccc cccc cc c so every line can have at most 10 characters.

I've created a regex for this which can probably find such tags but can't figure out how to extract text from matched group and then replace it.

snippet = re.sub(r'<span class="value">(.*)<\/span>', 
                 r'<span class="value">\1<\/span>'.(divide text using <br> tags)

Do you know how to do that?

Nooo... do **not** parse, process, generate XML/HTML with regular expressions. Use XPath, XSLT, BeautifulSoup,... — Willem Van Onsem, Apr 08 '17 at 12:10

kennytm · Answer 1 · 2017-04-08T12:21:58.353

The replacement argument of re.sub can be a function which takes a "match object" and return the replacement. You this you could do any transformation with the matched string.

def replace_text(m):
    return '<span class="value">' + divide_text(m.group(1)) + '</span>'

re.sub(r'<span class="value">(.*?)</span>', replace_text)

Note using an HTML parsing library gives much better control when the input does not just contain exactly the string , e.g.

import lxml.html

document = lxml.html.fromstring('''<html><body>
<span class="value">aaa</span>
<span class=value>bbb</span>
<span class="value-is-irrelevant">ccc</span>
<span class="value should-match-this-too">ddd</span>
</body></html>''')

# http://stackoverflow.com/q/1604471/
elements = document.xpath("//span[contains(concat(' ', @class, ' '), ' value ')]")
for element in elements:
    element.text = element.text.upper()
    # do your "divide text" here.

print(lxml.html.tostring(document))
# <html><body>
# <span class="value">AAA</span>
# <span class="value">BBB</span>
# <span class="value-is-irrelevant">ccc</span>
# <span class="value should-match-this-too">DDD</span>
# </body></html>

score 0 · Answer 2 · answered Apr 08 '17 at 12:16

This will dived the span every 10 characters.

import re
snippet = re.sub(r'<span class="value">(.*?)<\/span>', lambda m: "<br>".join([m.group(1)[i:i+10] for i in range(0, len(m.group(1)), 10)]), """<span class="value">cccc cc cccccc cccc cc c</span>""")
print(snippet)

How to change original match in re.sub

2 Answers2