1

I've done this:

>>> from bs4 import BeautifulSoup; import urllib2
>>> data = urllib2.urlopen('http://api.stackexchange.com/docs/').read()
>>> soup = BeautifulSoup(data.replace('""','"')) # there are errors on page
>>> soup.prettify()
<!DOCTYPE HTML>
<html lang="en">
............... # cut short
</html>

data seems to be alright and as expected. The problem is with the output of soup.

soup.prettify() is not outputting a string containing everything in data. It seems like soup is not parsing (or whatever it does) the entire string.

If you inspect the source of the webpage and the output of soup.prettify(), you'll see that they don't match up..
What is happening here and why?

I've got a feeling I'm not very clear in this post, If so please comment. I'll try to elaborate. Else feel free to remove this sentence...


Update
In reply to a comment by FakeRainBrigand, I would like to say that even on saving the html using the browser, the problem persists... So, even this has the same problem:

 data = open('Stack Exchange API.htm').read()
pradyunsg
  • 18,287
  • 11
  • 43
  • 96
  • It appears that SO is sending a different page to different user agents. It's not BS's fault; the problem is in line 2 where you download the page using urllib2. See this answer which explains [changing the user agent](http://stackoverflow.com/q/802134/1074592) (which is allowed in this case due to the lack of a robots.txt file). – Brigand Mar 16 '13 at 12:08
  • Well even on saving the HTML of the page, and using that, the problem persists... Updating.. – pradyunsg Mar 16 '13 at 12:14

1 Answers1

0

You haven't installed any other HTML parsers than the default one in Python (which is really not great).

pip install lxml

and reload everything, and BeautifulSoup will automatically pick up lxml and use it to parse the HTML instead. That just works for me, without the quotation-mark rewriting (which is a sign that something is afoot).

(The HTML is probably broken in some weird way, and Python's HTML parser isn't great at understanding that sort of thing, even with bs4's help.)

Katriel
  • 120,462
  • 19
  • 136
  • 170