I've been reading many q&a on how to remove all the html code from a string using python but none was satisfying. I need a way to remove all the tags, preserve/convert the html entities and work well with utf-8 strings.
Apparently BeautifulSoup is vulnerable to some specially crafted html strings, I built a simple parser with HTMLParser to get just the texts but I was losing the entities
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.data = []
def handle_data(self, data):
self.data.append(data)
def handle_charref(self, name):
self.data.append(name)
def handle_entityref(self, ent):
self.data.append(ent)
gives me something like
[u'Asia, sp', u'cialiste du voyage ', ...
losing the entity for the accented "e" in spécialiste.
Using one of the many regexp you can find as answers to similar questions it will always have some edge cases that were not considered.
Is there any really good module I could use?