Robustly Parsing HTML in Python

Question

Forgive me if this has been asked a billion times -- what are the available options for parsing HTML in Python, specifically I'm dealing with some legacy sites which have a lot of errors. Are there any parsers that are really fault tolerant?

check http://stackoverflow.com/q/717541/2870069, http://stackoverflow.com/q/6494199/2870069, http://stackoverflow.com/q/11709079/2870069, http://stackoverflow.com/q/13759158/2870069 and others — Jakob, Oct 22 '13 at 06:43

score 3 · Accepted Answer · answered Oct 22 '13 at 05:27

3

In my experience, among many python xml/html libs, Beautiful Soup is really good at processing broken HTML.

Raw:

<i>This <span title="a">is<br> some <html>invalid</htl %> HTML. 
<sarcasm>It's so great!</sarcasm>

Parsed with BeautifulSoup:

 <i>This 
  <span title="a">is
   <br /> some 
   <html>invalid HTML. 
    <sarcasm>It's so great!
    </sarcasm>
   </html>
  </span>
 </i>

answered Oct 22 '13 at 05:27

Leonardo.Z

9,425
3
35
38

Awesome, this looks like it will do the trick. – user2905592 Oct 22 '13 at 20:32

Robustly Parsing HTML in Python

1 Answers1