Unable to find all links with BeautifulSoup to extract links from a website (Link identification)

Question

I’m using this code found here ( retrieve links from web page using python and BeautifulSoup) to extract all links from a website using.

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.bestwestern.com.au')

for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
    if link.has_attr('href'):
        print link['href']

I’m using this site http://www.bestwestern.com.au as test. Unfortunately, I notice that the code is not extracting some links for example this one http://www.bestwestern.com.au/about-us/careers/ . I don’t know why. In the code of the page this is what I found.

<li><a href="http://www.bestwestern.com.au/about-us/careers/">Careers</a></li>

I think the extractor should normally identify it. On the BeautifulSoup documentation I can read: “The most common type of unexpected behavior is that you can’t find a tag that you know is in the document. You saw it going in, but find_all() returns [] or find() returns None. This is another common problem with Python’s built-in HTML parser, which sometimes skips tags it doesn’t understand. Again, the solution is to install lxml or html5lib.” So I installed html5lib. But I still have the same behavior.

Thank you for your help

I don't actually see "Careers" link on this page - are we looking onto the same page?.. — alecxe, Sep 19 '16 at 22:17
You'll see the "careers" link by looking the sitemap here http://www.bestwestern.com.au/sitemap/ — BND, Sep 20 '16 at 11:48

score 2 · Answer 1 · answered Jan 04 '17 at 13:52

Ok so this is a old question but I stumbled upon it in my search and it seems like it should be relatively simple to accomplish. I did switch from httplib2 to requests.

import requests
from bs4 import BeautifulSoup, SoupStrainer
baseurl = 'http://www.bestwestern.com.au'

SEEN_URLS = []
def get_links(url):
    response = requests.get(url)
    for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a', href=True)):
        print(link['href'])
        SEEN_URLS.append(link['href'])
        if baseurl in link['href'] and link['href'] not in SEEN_URLS:
            get_links(link['href'])

if __name__ == '__main__':
    get_links(baseurl)

alecxe · Accepted Answer · 2016-09-20T12:17:12.853

1

One problem is - you are using BeautifulSoup version 3 which is not being maintained anymore. You need to upgrade to BeautifulSoup version 4:

pip install beautifulsoup4

Another problem is that there is no "careers" link on the main page, but there is one on the "sitemap" page - request it and parse with the default html.parser parser - you'll see "careers" link printed among others:

import requests
from bs4 import BeautifulSoup, SoupStrainer

response = requests.get('http://www.bestwestern.com.au/sitemap/')

for link in BeautifulSoup(response.content, "html.parser", parse_only=SoupStrainer('a', href=True)):
    print(link['href'])

Note how I've moved the "has to have href" rule to the soup strainer.

edited Sep 20 '16 at 12:17

answered Sep 19 '16 at 22:04

alecxe

462,703
120
1,088
1,195

I have the version 4 of BeautifulSoup but still can't find the link. I don't know if the default parser as Python’s built-in HTML parser, but I think the problem may come from that side. – BND Sep 20 '16 at 12:03
This is another common problem with Python’s built-in HTML parser, which sometimes skips tags it doesn’t understand. Again, the solution is to install lxml or html5lib.” So I installed html5lib. But I still have the same behavior. – BND Sep 20 '16 at 12:06
@BND nono, as I've asked - there is no "careers" link on the main page, but there is one on the `sitemap` page - updated the code in the answer - works for me as is and prints the "carrers" link as well. – alecxe Sep 20 '16 at 12:17
Thank you for your help. It works for me too. But I don't really understand. Why the link can't be found from the main page http://www.bestwestern.com.au ? – BND Sep 20 '16 at 14:40
Thank you for your help. It works for me too. But I understand now. The code only extract links on the page not on all the website ? I'm looking for something to do the second one. – BND Sep 20 '16 at 14:57

Unable to find all links with BeautifulSoup to extract links from a website (Link identification)

2 Answers2