The lxml.html module is worth considering. However, it takes a bit of massaging to remove the CSS and JavaScript:
def stripsource(page):
from lxml import html
source = html.fromstring(page)
for item in source.xpath("//style|//script|//comment()"):
item.getparent().remove(item)
for line in source.itertext():
if line.strip():
yield line
The yielded lines can be simply concatenated, but that can lose significant
word boundaries, if there isn't any whitespace around whitespace-generating
tags.
You might also want to iterate over just the <body>
tag, depending on your requirements.