After trying several methods, to summarize it, this is how I did it. Following are two ways of avoiding/removing \xa0
characters from parsed HTML string.
Assume we have our raw html as following:
raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'
So lets try to clean this HTML string:
from bs4 import BeautifulSoup
raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'
text_string = BeautifulSoup(raw_html, "lxml").text
print text_string
#u'Dear Parent,\xa0This is a test message,\xa0kindly ignore it.\xa0Thanks'
The above code produces these characters \xa0 in the string. To remove them properly, we can use two ways.
Method # 1 (Recommended):
The first one is BeautifulSoup's get_text
method with strip
argument as True
So our code becomes:
clean_text = BeautifulSoup(raw_html, "lxml").get_text(strip=True)
print clean_text
# Dear Parent,This is a test message,kindly ignore it.Thanks
Method # 2:
The other option is to use python's library unicodedata
, specifically unicodedata.normalize
import unicodedata
text_string = BeautifulSoup(raw_html, "lxml").text
clean_text = unicodedata.normalize("NFKD",text_string)
print clean_text
# u'Dear Parent,This is a test message,kindly ignore it.Thanks'
I have also detailed these methods on this blog which you may want to refer.