Using regex with HTML is fraught with various issues, that is why you should be aware of all possible consequences. So, your <code>.+?</code>
regex will only work in case the <code>
and </code>
tags are on one line and if there are no nested <code>
tags inside them.
Assuming there are no nested code
tags you might extend your current approach:
import re
inputString = "I want to remove <code>tag with space</code> not sole <code>word</code>"
outputString = re.sub("<code>(.+?)</code>", lambda m: " " if " " in m.group(1) else m.group(), inputString, flags=re.S)
print(outputString)
The re.S
flag will enable .
to match line breaks and a lambda will help to perform a check against each match: any code tag that contains a whitespace in its node value will be turned into a regular space, else it will be kept.
See this Python demo
A more common way to parse HTML in Python is to use BeautifulSoup. First, parse the HTML, then get all the code
tags and then replace the code
tag if the nodes contains a space:
>>> from bs4 import BeautifulSoup
soup = BeautifulSoup('I want to remove <code>tag with space</code> not sole <code>word</code>', "html.parser")
>>> for p in soup.find_all('code'):
if p.string and " " in p.string:
p.replace_with(" ")
>>> print(soup)
I want to remove not sole <code>word</code>