Forgive me if this has been asked a billion times -- what are the available options for parsing HTML in Python, specifically I'm dealing with some legacy sites which have a lot of errors. Are there any parsers that are really fault tolerant?
Asked
Active
Viewed 377 times
2
-
check http://stackoverflow.com/q/717541/2870069, http://stackoverflow.com/q/6494199/2870069, http://stackoverflow.com/q/11709079/2870069, http://stackoverflow.com/q/13759158/2870069 and others – Jakob Oct 22 '13 at 06:43
1 Answers
3
In my experience, among many python xml/html libs, Beautiful Soup is really good at processing broken HTML.
Raw:
<i>This <span title="a">is<br> some <html>invalid</htl %> HTML.
<sarcasm>It's so great!</sarcasm>
Parsed with BeautifulSoup:
<i>This
<span title="a">is
<br /> some
<html>invalid HTML.
<sarcasm>It's so great!
</sarcasm>
</html>
</span>
</i>

Leonardo.Z
- 9,425
- 3
- 35
- 38