2

Here is a simple code in python 2.7.2, which fetches site and gets all links from given site:

import urllib2
from bs4 import BeautifulSoup

def getAllLinks(url):
    response = urllib2.urlopen(url)
    content = response.read()
    soup = BeautifulSoup(content, "html5lib")
    return soup.find_all("a")

links1 = getAllLinks('http://www.stanford.edu')
links2 = getAllLinks('http://med.stanford.edu/')

print len(links1)
print len(links2)

Problem is that it doesn't work in second scenario. It prints 102 and 0, while there are clearly links on the second site. BeautifulSoup doesn't throw parsing errors and it pretty prints markup ok. I suspect it maybe caused by first line from source of med.stanford.edu which says that it's xml (even though content-type is: text/html):

<?xml version="1.0" encoding="iso-8859-1"?>

I can't figure out how to set up Beautiful to disregard it, or workaround. I'm using html5lib as parser because I had problems with default one (incorrect markup).

David Robinson
  • 77,383
  • 16
  • 167
  • 187
slawek
  • 2,709
  • 1
  • 25
  • 29

2 Answers2

3

When a document claims to be XML, I find the lxml parser gives the best results. Trying your code but using the lxml parser instead of html5lib finds the 300 links.

Leonard Richardson
  • 3,994
  • 2
  • 17
  • 10
  • Indeed lxml parser fixes the problem. Before picking html5lib, I tried to install lxml, but I had problem installing it on Windows, exactly as in this question http://stackoverflow.com/q/1904752/808271, so I decided to go with html5lib as it installed without problems. After your answer I decided to give it a try once more time, and I managed to install lxml using compiled binaries for python 2.7 from this answer http://stackoverflow.com/a/9056298/808271. Thank you, I'm gonna change parser to lxml as it not only solves the problem but is recommended in documentation as the faster one. – slawek Apr 23 '12 at 20:50
2

You are precisely right that the problem is the <?xml... line. Disregarding it is very simple: just skip the first line of content, by replacing

    content = response.read()

with something like

    content = "\n".join(response.readlines()[1:])

Upon this change, len(links2) becomes 300.

ETA: You probably want to do this conditionally, so you don't always skip the first line of content. An example would be something like:

content = response.read()
if content.startswith("<?xml"):
    content = "\n".join(content.split("\n")[1:])
David Robinson
  • 77,383
  • 16
  • 167
  • 187
  • Great thing that you narrowed it down so easily, I was afraid also encoding could be a cause. I hoped that library provided some way to do that in more safe way. For example, there could be a site, where this line would be a second line instead of first one or there could be some whitespace at the beginning of this line. Probably regular expression would be enough, still there are many edge cases. Whole point of parsing library is to abstract from that, but I guess web is never that simple and straightforward. What's important, that is solves the problem, thanks. – slawek Apr 22 '12 at 18:23