Remember that the purpose of HTMLParser is to let you build a document tree from an input. If you don't care at all about the document's structure, then the str.join
solution @falsetru gives will be fine. You can be certain that all element tags and comments will be filtered out.
However, if you do need the structure for more complex scenarios then you have to build a document tree. The handle_starttag
and handle_endtag
methods are here for this.
First we need a basic tree that can hold some information.
class Element:
def __init__(self, parent, tag, attrs=None):
self.parent = parent
self.tag = tag
self.children = []
self.attrs = attrs or []
self.data = ''
Now you need to make the HTMLParser make a new node on every handle_starttag
and move up the tree on every handle_endtag
. We also pass the parsed data to the current node instead of holding it in the parser.
class MyHTMLParser(HTMLParser):
def __init__(self):
super().__init__()
self.root = Element(NONE, '__DOCROOT__') # Special root node for us
self.current = self.root
def handle_starttag(self, tag, attrs):
newel = Element(self.current tag, attrs)
self.current.children.append(newel)
self.current = newel
def handle_endtag(self, tag):
self.current = self.current.parent
def handle_data(self, data):
self.current.data += data
def handle_charref(self, ref): # No changes here
self.handle_entityref('#' + ref)
def handle_entityref(self, ref): # No changes here either
self.handle_data(self.unescape("&%s" % ref))
Now you can access the tree on MyHTMLParser.root
to get the data from any element as you like. For example
n = '<strong>I <3s U & you luvz me</strong>'
p = MyHTMLParser()
p.feed(n)
p.close()
def print_tree(node, indent=0):
print(' ' * indent + node.tag)
print(' ' * indent + ' ' + node.data)
for c in node.children:
print_tree(c, indent + 1)
print_tree(p.root)
This will give you
__DOCROOT__
strong
I <3s U & you luvz me
If instead you parsed n = <html><head><title>Test</title></head><body><h1>I <3s U & you luvz me</h1></body></html>
You would get.
__DOCROOT__
html
head
title
Test
body
h1
I <3s U & you luvz me
Next up is to make the tree building robust and handle cases like mismatched or implicit endtags. You will also want to add some nice find('tag')
like methods on Element
for traversing the tree. Do it well enough and you'll have made the next BeautifulSoup.