I’m using this code found here ( retrieve links from web page using python and BeautifulSoup) to extract all links from a website using.
import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
http = httplib2.Http()
status, response = http.request('http://www.bestwestern.com.au')
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
if link.has_attr('href'):
print link['href']
I’m using this site http://www.bestwestern.com.au as test. Unfortunately, I notice that the code is not extracting some links for example this one http://www.bestwestern.com.au/about-us/careers/ . I don’t know why. In the code of the page this is what I found.
<li><a href="http://www.bestwestern.com.au/about-us/careers/">Careers</a></li>
I think the extractor should normally identify it. On the BeautifulSoup documentation I can read: “The most common type of unexpected behavior is that you can’t find a tag that you know is in the document. You saw it going in, but find_all() returns [] or find() returns None. This is another common problem with Python’s built-in HTML parser, which sometimes skips tags it doesn’t understand. Again, the solution is to install lxml or html5lib.” So I installed html5lib. But I still have the same behavior.
Thank you for your help