I've done this:
>>> from bs4 import BeautifulSoup; import urllib2
>>> data = urllib2.urlopen('http://api.stackexchange.com/docs/').read()
>>> soup = BeautifulSoup(data.replace('""','"')) # there are errors on page
>>> soup.prettify()
<!DOCTYPE HTML>
<html lang="en">
............... # cut short
</html>
data
seems to be alright and as expected. The problem is with the output of soup
.
soup.prettify()
is not outputting a string containing everything in data. It seems like soup
is not parsing (or whatever it does) the entire string.
If you inspect the source of the webpage and the output of soup.prettify()
, you'll see that they don't match up..
What is happening here and why?
I've got a feeling I'm not very clear in this post, If so please comment. I'll try to elaborate. Else feel free to remove this sentence...
Update
In reply to a comment by FakeRainBrigand, I would like to say that even on saving the html using the browser, the problem persists... So, even this has the same problem:
data = open('Stack Exchange API.htm').read()