5

I've been reading many q&a on how to remove all the html code from a string using python but none was satisfying. I need a way to remove all the tags, preserve/convert the html entities and work well with utf-8 strings.

Apparently BeautifulSoup is vulnerable to some specially crafted html strings, I built a simple parser with HTMLParser to get just the texts but I was losing the entities

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.data = []

    def handle_data(self, data):
        self.data.append(data)

    def handle_charref(self, name):
        self.data.append(name)

    def handle_entityref(self, ent):
        self.data.append(ent)

gives me something like

[u'Asia, sp', u'cialiste du voyage ', ...

losing the entity for the accented "e" in spécialiste.

Using one of the many regexp you can find as answers to similar questions it will always have some edge cases that were not considered.

Is there any really good module I could use?

Arjuna Del Toso
  • 579
  • 3
  • 13

2 Answers2

4

bleach is excellent for this task. It does everything you need. It has an extensive test suite that checks for strange edge cases where tags could slip through. I have never had an issue with it.

Tim Heap
  • 1,671
  • 12
  • 11
  • bleach.clean('is not allowed', strip=True) this might be exactly what I need, I'll do some tests with utf-8, html entities and that stuff tonight and then let you know, thanks – Arjuna Del Toso Apr 09 '13 at 10:36
  • Bleach may not transform HTML entities in to their real UTF-8 counterpart. If it does not, try this question: http://stackoverflow.com/questions/57708/convert-xml-html-entities-into-unicode-string-in-python – Tim Heap Apr 10 '13 at 00:15
1

maybe pyquery? try easy_install / pip install pyquery; then some code like:

from pyquery import PyQuery as jQ

dom = jQ("<html>...</html>")
print dom("body").text() 
pinkdawn
  • 1,023
  • 11
  • 20