2

I am using BeautifulSoup to parse a bunch of possibly very dirty HTML documents. I stumbled upon a very bizarre thing.

The HTML comes from this page: http://www.wvdnr.gov/

It contains multiple errors, like multiple <html></html>, <title> outside the <head>, etc...

However, html5lib usually works well even in these cases. In fact, when I do:

soup = BeautifulSoup(document, "html5lib")

and I pretti-print soup, I see the following output: http://pastebin.com/8BKapx88

which contains a lot of <a> tags.

However, when I do soup.find_all("a") I get an empty list. With lxml I get the same.

So: has anybody stumbled on this problem before? What is going on? How do I get the links that html5lib found but isn't returning with find_all?

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Mikk
  • 804
  • 8
  • 23

2 Answers2

4

Even if the correct answer is "use another parser" (thanks @alecxe), I have another workaround. For some reason, this works too:

soup = BeautifulSoup(document, "html5lib")
soup = BeautifulSoup(soup.prettify(), "html5lib")
print soup.find_all('a')

which returns the same link list of:

soup = BeautifulSoup(document, "html.parser")
Mikk
  • 804
  • 8
  • 23
3

When it comes to parsing a not well-formed and tricky HTML, the parser choice is very important:

There are also differences between HTML parsers. If you give Beautiful Soup a perfectly-formed HTML document, these differences won’t matter. One parser will be faster than another, but they’ll all give you a data structure that looks exactly like the original HTML document.

But if the document is not perfectly-formed, different parsers will give different results.

html.parser worked for me:

from bs4 import BeautifulSoup
import requests

document = requests.get('http://www.wvdnr.gov/').content
soup = BeautifulSoup(document, "html.parser")
print soup.find_all('a')

Demo:

>>> from bs4 import BeautifulSoup
>>> import requests
>>> document = requests.get('http://www.wvdnr.gov/').content
>>>
>>> soup = BeautifulSoup(document, "html5lib")
>>> len(soup.find_all('a'))
0
>>> soup = BeautifulSoup(document, "lxml")
>>> len(soup.find_all('a'))
0
>>> soup = BeautifulSoup(document, "html.parser")
>>> len(soup.find_all('a'))
147

See also:

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Very interesting. I was assuming html5lib and lxml were better than html.parser. Oh well, now I know better. Thanks! – Mikk Nov 12 '14 at 21:17
  • html5lib should match what the HTML spec does and what web browsers do. If it doesn't, something is wrong. – gsnedders Dec 03 '14 at 11:29
  • Hmm — html5lib seems to find 147 `a` elements on its own, yet within BeautifulSoup is finding none. That seems like something is wrong on the BeautifulSoup side rather than html5lib. – gsnedders Dec 03 '14 at 11:33
  • I've reported this bug in BS4 [here](https://bugs.launchpad.net/beautifulsoup/+bug/1450884), FWIW. – gsnedders May 01 '15 at 19:09