1

I have a large HTML source code I would like to parse (~200,000) lines, and I'm fairly certain there is some poor formatting throughout. I've been researching some parsers, and it seems Beautiful Soup, lxml, html5lib are the most popular. From reading this website, it seems lxml is the most commonly used and fastest, while Beautiful Soup is slower but accounts for more errors and variation.

I'm a little confused on the Beautiful Soup documentation, http://www.crummy.com/software/BeautifulSoup/bs4/doc/, and commands like BeautifulSoup(markup, "lxml") or BeautifulSoup(markup, html5lib). In such instances is it using both Beautiful Soup and html5lib/lxml? Speed is not really an issue here, but accuracy is. The end goal is to parse get the source code using urllib2, and retrieve all the text data from the file as if I were to just copy/paste the webpage.

P.S. Is there anyway to parse the file without returning any whitespace that were not present in the webpage view?

zhuyxn
  • 6,671
  • 9
  • 38
  • 44

1 Answers1

4

My understanding (having used BeautifulSoup for a handful of things) is that it is a wrapper for parsers like lxml or html5lib. Using whichever parser is specified (I believe the default is HTMLParser, the default parser for python), BeautifulSoup creates a tree of tag elements and such that make it quite easy to navigate and search the HTML for useful data continued within tags. If you really just need the text from the webpages and not more specific data from specific HTML tags, you might only need a code snippet similar to this:

from bs4 import BeautifulSoup
import urllib2
soup = BeautifulSoup(urllib2.urlopen("http://www.google.com")
soup.get_text()

get_text isn't that great with complex webpages (it gets random javascript or css occasionally), but if you get the hang of how to use BeautifulSoup, it shouldn't be hard to get only the text you want.

For your purposes it seems like you don't need to worry about getting one of those other parsers to use with BeautifulSoup (html5lib or lxml). BeautifulSoup can deal with some sloppiness on its own, and if it can't, it will give an obvious error about "malformed HTML" or something of the sort, and that would be an indication to install html5lib or lxml.

Paul Whalen
  • 453
  • 6
  • 21
  • Thanks! That's almost exactly the code I have currently, and it hasn't thrown any errors I can see, though I was just wondering if using lxml/html5lib was recommended in any way. Is there anyway to eliminate whitespace that was not originally in the webpage (ie. only between paragraphs or sections of text). It seems like calling .get_text() is returning as much whitespace as the original source code. – zhuyxn Jun 08 '12 at 03:54
  • Whitespace in HTML is by standard collapsed down to a single space, so your original HTML may have a lot of whitespace but in your browser, you'll only see a single space. It is easy to write a whitespace collapser: `import re; from functools import partial; collapse_ws = partial(re.sub, re.compile(r'\s+'), ' ')` Then use `collapse_ws` as a function, like `s = collapse_ws("lsdjf slkdf slfkj lsj")` and you'll get `'lsdjf slkdf slfkj lsj'`. – PaulMcG Jun 08 '12 at 11:41