0

I’m using this code found here ( retrieve links from web page using python and BeautifulSoup) to extract all links from a website using.

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.bestwestern.com.au')

for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
    if link.has_attr('href'):
        print link['href']

I’m using this site http://www.bestwestern.com.au as test. Unfortunately, I notice that the code is not extracting some links for example this one http://www.bestwestern.com.au/about-us/careers/ . I don’t know why. In the code of the page this is what I found.

<li><a href="http://www.bestwestern.com.au/about-us/careers/">Careers</a></li>

I think the extractor should normally identify it. On the BeautifulSoup documentation I can read: “The most common type of unexpected behavior is that you can’t find a tag that you know is in the document. You saw it going in, but find_all() returns [] or find() returns None. This is another common problem with Python’s built-in HTML parser, which sometimes skips tags it doesn’t understand. Again, the solution is to install lxml or html5lib.” So I installed html5lib. But I still have the same behavior.

Thank you for your help

Community
  • 1
  • 1
BND
  • 612
  • 1
  • 13
  • 23
  • I don't actually see "Careers" link on this page - are we looking onto the same page?.. – alecxe Sep 19 '16 at 22:17
  • You'll see the "careers" link by looking the sitemap here http://www.bestwestern.com.au/sitemap/ – BND Sep 20 '16 at 11:48

2 Answers2

2

Ok so this is a old question but I stumbled upon it in my search and it seems like it should be relatively simple to accomplish. I did switch from httplib2 to requests.

import requests
from bs4 import BeautifulSoup, SoupStrainer
baseurl = 'http://www.bestwestern.com.au'

SEEN_URLS = []
def get_links(url):
    response = requests.get(url)
    for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a', href=True)):
        print(link['href'])
        SEEN_URLS.append(link['href'])
        if baseurl in link['href'] and link['href'] not in SEEN_URLS:
            get_links(link['href'])

if __name__ == '__main__':
    get_links(baseurl)
StoneyD
  • 306
  • 2
  • 10
1

One problem is - you are using BeautifulSoup version 3 which is not being maintained anymore. You need to upgrade to BeautifulSoup version 4:

pip install beautifulsoup4

Another problem is that there is no "careers" link on the main page, but there is one on the "sitemap" page - request it and parse with the default html.parser parser - you'll see "careers" link printed among others:

import requests
from bs4 import BeautifulSoup, SoupStrainer

response = requests.get('http://www.bestwestern.com.au/sitemap/')

for link in BeautifulSoup(response.content, "html.parser", parse_only=SoupStrainer('a', href=True)):
    print(link['href'])

Note how I've moved the "has to have href" rule to the soup strainer.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • I have the version 4 of BeautifulSoup but still can't find the link. I don't know if the default parser as Python’s built-in HTML parser, but I think the problem may come from that side. – BND Sep 20 '16 at 12:03
  • This is another common problem with Python’s built-in HTML parser, which sometimes skips tags it doesn’t understand. Again, the solution is to install lxml or html5lib.” So I installed html5lib. But I still have the same behavior. – BND Sep 20 '16 at 12:06
  • @BND nono, as I've asked - there is no "careers" link on the main page, but there is one on the `sitemap` page - updated the code in the answer - works for me as is and prints the "carrers" link as well. – alecxe Sep 20 '16 at 12:17
  • Thank you for your help. It works for me too. But I don't really understand. Why the link can't be found from the main page http://www.bestwestern.com.au ? – BND Sep 20 '16 at 14:40
  • Thank you for your help. It works for me too. But I understand now. The code only extract links on the page not on all the website ? I'm looking for something to do the second one. – BND Sep 20 '16 at 14:57