2

Forgive me if this has been asked a billion times -- what are the available options for parsing HTML in Python, specifically I'm dealing with some legacy sites which have a lot of errors. Are there any parsers that are really fault tolerant?

  • check http://stackoverflow.com/q/717541/2870069, http://stackoverflow.com/q/6494199/2870069, http://stackoverflow.com/q/11709079/2870069, http://stackoverflow.com/q/13759158/2870069 and others – Jakob Oct 22 '13 at 06:43

1 Answers1

3

In my experience, among many python xml/html libs, Beautiful Soup is really good at processing broken HTML.

Raw:

<i>This <span title="a">is<br> some <html>invalid</htl %> HTML. 
<sarcasm>It's so great!</sarcasm>

Parsed with BeautifulSoup:

 <i>This 
  <span title="a">is
   <br /> some 
   <html>invalid HTML. 
    <sarcasm>It's so great!
    </sarcasm>
   </html>
  </span>
 </i>
Leonardo.Z
  • 9,425
  • 3
  • 35
  • 38