How to return everything in a string that is not matched by a regex?

Question

I have a string and a regular expression that matches portions of the string. I want to return a string representing what's left of the original string after all matches have been removed.

import re

string="<font size="2px" face="Tahoma"><br>Good Morning,&nbsp;</font><div><br></div><div>As per last email"

pattern = r'<[a-zA-Z0-9 ="/\-:;.]*>'

re.findall(pattern, string)

['<font size="2px" face="Tahoma">',
 '<br>',
 '</font>',
 '<div>',
 '<br>',
 '</div>',
 '<div>']

desired_string = "Good Morning,&nbsp;As per last email"

score 3 · Answer 1 · edited May 23 '17 at 12:24

Instead of re.findall, use re.sub to replace each matche with an empty string.

re.sub(pattern, "", string)

While that's the literal answer to your general question about removing patterns from a string, it appears that your specific problem is related to manipulating HTML. It's generally a bad idea to try to manipulate HTML with regular expressions. For more information see this answer to a similar question: https://stackoverflow.com/a/1732454/7432

score 1 · Accepted Answer · answered Apr 13 '16 at 17:08

Instead of a regular expression, use an HTML parser like BeautifulSoup. It looks like you are trying to strip the HTML elements and get the underlying text.

from bs4 import BeautifulSoup

string="""<font size="2px" face="Tahoma"><br>Good Morning,&nbsp;</font><div><br></div><div>As per last email"""

soup = BeautifulSoup(string, 'lxml')

print(soup.get_text())

This outputs:

Good Morning, As per last email

One thing to notice is that the   was changed to a regular space using this method.

[i.e. don't use regex to parse HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Jared Goguen, Apr 13 '16 at 17:13

How to return everything in a string that is not matched by a regex?

2 Answers2