BeautifulSoup fails to parse long view state

Question

I try to use BeautifulSoup4 to parse the html retrieved from http://exporter.nih.gov/ExPORTER_Catalog.aspx?index=0 If I print out the resulting soup, it ends like this:

kZXI9IjAi"/></form></body></html>

Searching for the last characters 9IjaI in the raw html, I found that it's in the middle of a huge viewstate. BeautifulSoup seems to have a problem with this. Any hint what I might be doing wrong or how to parse such a page?

Martijn Pieters · Accepted Answer · 2013-08-09T16:06:07.930

1

BeautifulSoup uses a pluggable HTML parser to build the 'soup'; you need to try out different parsers, as each will treat a broken page differently.

I had no problems parsing that page with any of the parsers, however:

>>> from beautifulsoup4 import BeautifulSoup
>>> import requests
>>> r = requests.get('http://exporter.nih.gov/ExPORTER_Catalog.aspx?index=0')
>>> for parser in ('html.parser', 'lxml', 'html5lib'):
...     print repr(str(BeautifulSoup(r.text, parser))[-60:])
... 
';\r\npageTracker._trackPageview();\r\n</script>\n</body>\n</html>\n'
'();\r\npageTracker._trackPageview();\r\n</script>\n</body></html>'
'();\npageTracker._trackPageview();\n</script>\n\n\n</body></html>'

Make sure you have the latest BeautifulSoup4 package installed, I have seen consistent problems in the 4.1 series solved in 4.2.

edited Aug 09 '13 at 16:06

answered Aug 09 '13 at 15:54

Martijn Pieters

1,048,767
296
4,058
3,343

just wondering, is r.text defined elsewhere? or is a key word – sihrc Aug 09 '13 at 15:59
@sihrc: fleshed out my sample; `r` is a `requests` library response. – Martijn Pieters Aug 09 '13 at 16:00
I see.. would it be requests.get? or is requests callable? – sihrc Aug 09 '13 at 16:03
1

I missed `.get()`, as I typed that out by hand instead of searching my session to copy it. Oops, corrected. – Martijn Pieters Aug 09 '13 at 16:05
In my case there seems to be a problem with lxml. html.parser works just fine. – Achim Aug 14 '13 at 22:05
@Achim: Yes, there are problems with specific versions of the `libxml2` library, I believe. – Martijn Pieters Aug 14 '13 at 22:07

BeautifulSoup fails to parse long view state

1 Answers1

Linked