0

I have a string and a regular expression that matches portions of the string. I want to return a string representing what's left of the original string after all matches have been removed.

import re

string="<font size="2px" face="Tahoma"><br>Good Morning,&nbsp;</font><div><br></div><div>As per last email"

pattern = r'<[a-zA-Z0-9 ="/\-:;.]*>'

re.findall(pattern, string)

['<font size="2px" face="Tahoma">',
 '<br>',
 '</font>',
 '<div>',
 '<br>',
 '</div>',
 '<div>']

desired_string = "Good Morning,&nbsp;As per last email"
ADJ
  • 4,892
  • 10
  • 50
  • 83

2 Answers2

3

Instead of re.findall, use re.sub to replace each matche with an empty string.

re.sub(pattern, "", string)

While that's the literal answer to your general question about removing patterns from a string, it appears that your specific problem is related to manipulating HTML. It's generally a bad idea to try to manipulate HTML with regular expressions. For more information see this answer to a similar question: https://stackoverflow.com/a/1732454/7432

Community
  • 1
  • 1
Bryan Oakley
  • 370,779
  • 53
  • 539
  • 685
1

Instead of a regular expression, use an HTML parser like BeautifulSoup. It looks like you are trying to strip the HTML elements and get the underlying text.

from bs4 import BeautifulSoup

string="""<font size="2px" face="Tahoma"><br>Good Morning,&nbsp;</font><div><br></div><div>As per last email"""

soup = BeautifulSoup(string, 'lxml')

print(soup.get_text())

This outputs:

Good Morning, As per last email

One thing to notice is that the &nbsp; was changed to a regular space using this method.

Andy
  • 49,085
  • 60
  • 166
  • 233
  • [i.e. don't use regex to parse HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Jared Goguen Apr 13 '16 at 17:13