Retrieve all URLs to web pages on the same domain, but no media or anchors

Question

I'm new to Python, and extremely impressed by the amount of libraries at my disposal. I have a function already that uses Beautiful Soup to extract URLs from a site, but not all of them are relevant. I only want webpages (no media) on the same website (domain or subdomain, but no other domains). I'm trying to manually program around examples I run into, but I feel like I'm reinventing the wheel - surely this is a common problem in internet applications.

Here's an example list of URLs that I might retrieve from a website, say http://example.com, with markings for whether or not I want them and why. Hopefully this illustrates the issue.

Good:

example.com/page - it links to another page on the same domain
example.com/page.html - has a filetype ending but it's an HTML page
subdomain.example.com/page.html - it's on the same site, though on a subdomain
/about/us - it's a relative link, so it doesn't have the domain it it, but it's implied

Bad:

otherexample.com/page - bad, the domain doesn't match
example.com/image.jpg - bad, it's an image and not a page
/ - bad - sometimes there's just a slash in the "a" tag, but that's a reference to the page I'm already on
#anchor - this is also a relative link, but it's on the same page, so there's no need for it

I've been writing cases in if statements for each of these...but there has to be a better way!

Edit: Here's my current code, which returns nothing:

ignore_values = {"", "/"}
def desired_links(href):
     # ignore if href is not set
     if not href:
         return False

     # ignore if it is just a link to the same page
     if href.startswith("#"):
         return False

     # skip ignored values
     if href in ignore_values:
         return False


 def explorePage(pageURL):
 #Get web page
     opener = urllib2.build_opener()
     opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
     response = opener.open(pageURL)
     html = response.read()

     #Parse web page for links
     soup = BeautifulSoup(html, 'html.parser')
     links = [a["href"] for a in soup.find_all("a", href=desired_links)]
     for link in links:
         print(link)

     return



 def main():
     explorePage("http://xkcd.com")

You just have to create some rules and apply them to each href — Padraic Cunningham, Oct 09 '16 at 18:53
That's what I've been doing - ignore if it's just a slash...remove the http:// if it exists and make sure it says example.com before the first slash, otherwise ignore...remove all # and anything that follows, if it's now empty then ignore...is writing them out manually the only way to do it? No libraries out there that could help? — Jake, Oct 09 '16 at 18:54
A big step to filtering would be `soup.select("a[href*=example.com]")` — Padraic Cunningham, Oct 09 '16 at 18:54

alecxe · Accepted Answer · 2016-10-09T19:17:07.100

4

BeautifulSoup is quite flexible in helping you to create and apply the rules to attribute values. You can create a filtering function and use it as a value for the href argument to find_all().

For example, something for you to start with:

ignore_values = {"", "/"}
def desired_links(href):
    # ignore if href is not set
    if not href:
        return False

    # ignore if it is just a link to the same page
    if href.startswith("#"):
        return False

    # skip ignored values
    if href in ignore_values:
        return False

    # TODO: more rules
    # you would probably need "urlparse" package for a proper url analysis

    return True

Usage:

links = [a["href"] for a in soup.find_all("a", href=desired_links)]

You should take a look at Scrapy and its Link Extractors.

edited Oct 09 '16 at 19:17

answered Oct 09 '16 at 18:56

alecxe

462,703
120
1,088
1,195

1

This is awesome. Man, Python comes with the coolest things. I have some questions, though - first, there's an "if '#' in href", but sometimes there are links to other pages on the site (valid) that just happen to have anchor tags in them. So I only want to ignore the anchor tags alone, so I guess just strip them? It appears that this would ignore any URL with a # in it, is that right? – Jake Oct 09 '16 at 18:59
@Jake yeah, good catch, this part has to be improved to take into account the current page only. – alecxe Oct 09 '16 at 19:01
1

`if not href.startswith("#")` would probably work, or mayble `if not href.startswith("#") or not href.endswith(("#", ".png", ".jpg"))`, you can pass a tuple of args to both endswith and startswith – Padraic Cunningham Oct 09 '16 at 19:02
I just updated my question with some example code, heavily based on what's here - unfortunately, it returns nothing - any idea why? – Jake Oct 09 '16 at 19:16
@Jake sure, do you have the `return True` in the end of the function? – alecxe Oct 09 '16 at 19:17
@alecxe Ah there we go, I see now that you just added that in - thank you very much!! It now works perfectly as expected. One last thing - can it be combined with Padriac's selector, which limits it to only links on the same domain? – Jake Oct 09 '16 at 19:39
@Jake no, you cannot combine the `select()` and `find_all()` in a single call, but you can check the domain in a more reliable way using `urlparse()` - see http://stackoverflow.com/questions/9626535/get-domain-name-from-url. Hope that helps. – alecxe Oct 10 '16 at 03:34

Retrieve all URLs to web pages on the same domain, but no media or anchors

1 Answers1