Extracting Text from HTML markup?

Question

Possible Duplicate:
Extracting text from HTML file using Python
Parsing Source Code (Python) Approach: Beautiful Soup, lxml, html5lib difference?

Currently have a large webpage whose source code is ~200,000 lines of almost all (if not all) HTML. More specifically, it is a webpage whose content is a few thousand blocks of text separated by line breaks (though a line break does not specifically mean there is a separation in content)

My main objective is to extract text from the source code as if I were copying/pasting the webpage into a text editor. There is another parsing function I would like to use, which originally took in copied/pasted text rather than the source code.

To do this, I'm currently using urllib2, and calling .get_text() in Beautiful Soup. The problem is, Beautiful Soup is leaving tremendous amounts of white space in my code, and it is difficult to pass the result into the second "text" parser. I have done quite a bit of research on parsing HTMLs, but I'm frankly not sure how to solve this problem easily. Furthermore, I'm a bit confused on how to use imports like lxml to extract text.

tl; dr: Is there any possible way to achieve a result as if I just did Select All, Copy, Paste on a webpage?

If you've got a solution but the only problem is there's too much white space, can't you just remove the extra white space? Try `re.sub(r"\s+", " ", text)`. — Greg Hewgill, Jun 08 '12 at 04:44
--David Thanks for the correction! @GregHewgill That would remove the section spacing present in the original webpage no? Another parsing function I have uses these white spaces in its function as a delimiter of sorts, so I would prefer not to remove them. ): — zhuyxn, Jun 08 '12 at 04:52

score 1 · Answer 1 · edited Jun 08 '12 at 07:02

1

It sounds like you want to render the HTML as text, not extract the content of various tags.

If that's the case, consider one of these run as a subprocess from your Python code:

links -html-numbered-links 1 -html-images 1 -dump "file://$@"
lynx -force_html -dump "$@"
w3m -T text/html -F -dump "$@"

edited Jun 08 '12 at 07:02

David Cain

16,484
14
65
75

answered Jun 08 '12 at 05:01

user1277476

2,871
12
10

score 0 · Answer 2 · answered Jun 08 '12 at 06:05

Have your tried looking into an HTML parser. If you just want the meat of the html page with out the tag notation, you can just use:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.tags = []
        self.attrs = []
    def handle_starttag(self, tag, attrs):
        self.tags.append(tag)
        self.attrs.append(attrs)
    def handle_endtag(self, tag):
        if tag not in self.tags:return
        for x in reversed(self.tags):
            self.tags.pop()
            self.attrs.pop()
            if tag == x:return
    def handle_data(self, data):
        print data

parser = MyHTMLParser()
f = file("temp.html")
parser.feed(f.read())
f.close()

This will parse the data inside the html page. <div><h1>This is my webpage</h1><div></div></div> will be printed as This is my webpage. You can modify whatever method you want to show different sections, different formats, etc etc. Just change the basic class to your liking, my code should just get you started on the right path.

Extracting Text from HTML markup?

2 Answers2