Python: unescape special characters without splitting data

Question

I have made a simple HTML parser which is basically a direct copy from the docs. I am having trouble unescaping special characters without also splitting up data into multiple chunks.

Here is my code with a simple example:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.data = []

    def handle_starttag(self, tag, attrs):
        #print (tag,attrs)
        pass

    def handle_endtag(self, tag):
        #print (tag)
        pass

    def handle_data(self, data):
        self.data.append(data)

    def handle_charref(self, ref):
        self.handle_entityref("#" + ref)

    def handle_entityref(self, ref):
        self.handle_data(self.unescape("&%s;" % ref))



n = "<strong>I &lt;3s U &amp; you luvz me</strong>"


parser = MyHTMLParser()
parser.feed(n)
parser.close()
data = parser.data
print(data)

The issue is that this returns 5 separate bits of data

['I ', u'<', '3s U ', u'&', ' you luvz me']

Where what I want is the single string:

['I <3s U & you luvz me']

Thanks JP

score 3 · Accepted Answer · answered Jan 02 '14 at 03:58

3

Join the list of strings using str.join:

>>> ''.join(['I ', u'<', '3s U ', u'&', ' you luvz me'])
u'I <3s U & you luvz me'

Alternatively, you can use external libraries, like lxml:

>>> import lxml.html
>>> n = "<strong>I &lt;3s U &amp; you luvz me</strong>"
>>> root = lxml.html.fromstring(n)
>>> root.text_content()
'I <3s U & you luvz me'

answered Jan 02 '14 at 03:58

falsetru

357,413
63
732
636

Thanks falsetru. Join might work in this simple case, but with more complex html it will be much harder to work out what to join. The reason I haven't used lxml (or ElementTree) is that I don't really know how they work and am trying to learn from first principles before using them. But will look into it more. – jprockbelly Jan 02 '14 at 04:13

score 1 · Answer 2 · answered Jan 02 '14 at 07:57

Remember that the purpose of HTMLParser is to let you build a document tree from an input. If you don't care at all about the document's structure, then the str.join solution @falsetru gives will be fine. You can be certain that all element tags and comments will be filtered out.

However, if you do need the structure for more complex scenarios then you have to build a document tree. The handle_starttag and handle_endtag methods are here for this.

First we need a basic tree that can hold some information.

class Element:
    def __init__(self, parent, tag, attrs=None):
        self.parent = parent
        self.tag = tag
        self.children = []
        self.attrs = attrs or []
        self.data = ''

Now you need to make the HTMLParser make a new node on every handle_starttag and move up the tree on every handle_endtag. We also pass the parsed data to the current node instead of holding it in the parser.

class MyHTMLParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.root = Element(NONE, '__DOCROOT__') # Special root node for us
        self.current = self.root

    def handle_starttag(self, tag, attrs):
        newel = Element(self.current tag, attrs)
        self.current.children.append(newel)
        self.current = newel

    def handle_endtag(self, tag):
        self.current = self.current.parent

    def handle_data(self, data):
        self.current.data += data

    def handle_charref(self, ref): # No changes here
        self.handle_entityref('#' + ref)

    def handle_entityref(self, ref): # No changes here either
        self.handle_data(self.unescape("&%s" % ref))

Now you can access the tree on MyHTMLParser.root to get the data from any element as you like. For example

n = '<strong>I &lt;3s U &amp; you luvz me</strong>'
p = MyHTMLParser()
p.feed(n)
p.close()

def print_tree(node, indent=0):
    print('    ' * indent + node.tag)
    print('    ' * indent + '  ' + node.data)
    for c in node.children:
        print_tree(c, indent + 1)

print_tree(p.root)

This will give you

__DOCROOT__

    strong
      I <3s U & you luvz me

If instead you parsed n = <html><head><title>Test</title></head><body><h1>I <3s U & you luvz me</h1></body></html> You would get.

__DOCROOT__

    html

        head

            title
              Test
        body

            h1
              I <3s U & you luvz me

Next up is to make the tree building robust and handle cases like mismatched or implicit endtags. You will also want to add some nice find('tag') like methods on Element for traversing the tree. Do it well enough and you'll have made the next BeautifulSoup.

Great thank you, there is a lot here that I don't fully understand but I will read up on it now. I've accepted falsetru's answer as it is best for the question I actually asked, although your answer is better for what I actually wanted to know. — jprockbelly, Jan 03 '14 at 03:11

score 1 · Answer 3 · edited May 23 '17 at 11:52

You can refer this answer.

And edit html_to_text function for you want.

from HTMLParser import HTMLParser
n = "<strong>I &lt;3s U &amp; you luvz me</strong>"

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def handle_entityref(self, name):
        self.fed.append('&%s;' % name)
    def get_data(self):
        return ''.join(self.fed)

def html_to_text(html):
    s = MLStripper()
    s.feed(html)
    return HTMLParser().unescape(s.get_data())

print html_to_text(n)

Output:

I <3s U & you luvz me

Python: unescape special characters without splitting data

3 Answers3