In response to Python regular expression I tried to implement an HTML parser using HTMLParser
:
import HTMLParser
class ExtractHeadings(HTMLParser.HTMLParser):
def __init__(self):
HTMLParser.HTMLParser.__init__(self)
self.text = None
self.headings = []
def is_relevant(self, tagname):
return tagname == 'h1' or tagname == 'h2'
def handle_starttag(self, tag, attrs):
if self.is_relevant(tag):
self.in_heading = True
self.text = ''
def handle_endtag(self, tag):
if self.is_relevant(tag):
self.headings += [self.text]
self.text = None
def handle_data(self, data):
if self.text != None:
self.text += data
def handle_charref(self, name):
if self.text != None:
if name[0] == 'x':
self.text += chr(int(name[1:], 16))
else:
self.text += chr(int(name))
def handle_entityref(self, name):
if self.text != None:
print 'TODO: entity %s' % name
def extract_headings(text):
parser = ExtractHeadings()
parser.feed(text)
return parser.headings
print extract_headings('abdk3<h1>The content we need</h1>aaaaabbb<h2>The content we need2</h2>')
print extract_headings('before<h1>Hello</h1>after')
Doing that I wondered if the API of this module is bad or if I didn't notice some important things. My questions are:
- Why does my implementation of
handle_charref
have to be that complex? I would have expected that a good API passes the codepoint as a parameter, not eitherx6c
or72
as string. - Why doesn't the default implementation of
handle_charref
callhandle_data
with an appropriate string? - Why is there no utility implementation of
handle_entityref
that I could just call? It could be namedhandle_entityref_HTML4
and would lookup the entities defined in HTML 4 and then callhandle_data
on them.
If that API were provided, writing custom HTML parsers would be much easier. So where is my misunderstanding?