How to use regex to remove string within certain HTML tag and string must contain empty space

Question

I try to clean some HTML data with regular expression in python. Given the input string with HTML tags, I want to remove tags and its content if the content contains space. The requirements is like below:

inputString = "I want to remove <code>tag with space</code> not sole <code>word</code>"
outputString = regexProcess(inputString)
print outputString

>>I want to remove not sole <code>word</code>

The regex re.sub("<code>.+?</code>", " ", inputString) can only remove all tags, how to improve it or are there some other methods?

Thanks in advance.

more constructive: https://www.crummy.com/software/BeautifulSoup/ — hiro protagonist, Jan 03 '17 at 09:34

score 4 · Accepted Answer · answered Jan 03 '17 at 09:56

Using regex with HTML is fraught with various issues, that is why you should be aware of all possible consequences. So, your <code>.+?</code> regex will only work in case the <code> and </code> tags are on one line and if there are no nested <code> tags inside them.

Assuming there are no nested code tags you might extend your current approach:

import re
inputString = "I want to remove <code>tag with space</code> not sole <code>word</code>"
outputString = re.sub("<code>(.+?)</code>", lambda m: " " if " " in m.group(1) else m.group(), inputString, flags=re.S)
print(outputString)

The re.S flag will enable . to match line breaks and a lambda will help to perform a check against each match: any code tag that contains a whitespace in its node value will be turned into a regular space, else it will be kept.

See this Python demo

A more common way to parse HTML in Python is to use BeautifulSoup. First, parse the HTML, then get all the code tags and then replace the code tag if the nodes contains a space:

>>> from bs4 import BeautifulSoup
soup = BeautifulSoup('I want to remove <code>tag with space</code> not sole <code>word</code>', "html.parser")
>>> for p in soup.find_all('code'):
    if p.string and " " in p.string:
        p.replace_with(" ")

>>> print(soup)
I want to remove   not sole <code>word</code>

This is a more complete answer, I learned from it. Thanks a lot. — ccy, Jan 03 '17 at 11:17

Jean-François Fabre · Answer 2 · 2017-01-03T09:57:35.557

1

bad idea to parse HTML with regex. However if your HTML is simple enough you could do this:

re.sub(r"<code>[^<]*\s[^<]*</code>", " ", inputString)

We're looking for at least a space somewhere, to be able to make it work with code tags on the same line, I've added filtering on < char (it has no chance to be in a tag, since even escaping it is <).

Ok, it's still a hack, a proper html parser is preferred.

small test:

inputString = "<code>hello </code>  <code>world</code> <code>hello world</code> <code>helloworld</code>"

I get:

  <code>world</code>   <code>helloworld</code>

edited Jan 03 '17 at 09:57

answered Jan 03 '17 at 09:34

Jean-François Fabre

137,073
23
153
219

3

This is a common error when using regex to parse HTML. It won't work in case there are several code tags on one line, if the first has no whitespace inside and the second has it. – Wiktor Stribiżew Jan 03 '17 at 09:35
Well, at least the user is warned. – Jean-François Fabre Jan 03 '17 at 09:36
Please fix or remove this answer, it is wrong. I mean you can keep the warning, but the regex is wrong and misleading. People think it is cool, while it is not. – Wiktor Stribiżew Jan 03 '17 at 09:42
`hello world hello world helloworld` – Vasili Syrakis Jan 03 '17 at 09:43
@WiktorStribiżew thanks for the reminding, I will pay attention – ccy Jan 03 '17 at 09:44
@WiktorStribiżew I understand. I've edited so there's no problem with multiple tags on the same line. – Jean-François Fabre Jan 03 '17 at 09:52
And I provided a way to use BeautifulSoup for this task and an alternative OP code "extension". – Wiktor Stribiżew Jan 03 '17 at 09:58

score 0 · Answer 3 · answered Jan 03 '17 at 09:52

0

You can used to remove tags according to open and close tags also .

inputString = re.sub(r"<.*?>", " ", inputString)

In my case it is working . Enjoy ...

answered Jan 03 '17 at 09:52

rofelia09

51
11

How to use regex to remove string within certain HTML tag and string must contain empty space

3 Answers3