2

I'm new to Python, and extremely impressed by the amount of libraries at my disposal. I have a function already that uses Beautiful Soup to extract URLs from a site, but not all of them are relevant. I only want webpages (no media) on the same website (domain or subdomain, but no other domains). I'm trying to manually program around examples I run into, but I feel like I'm reinventing the wheel - surely this is a common problem in internet applications.

Here's an example list of URLs that I might retrieve from a website, say http://example.com, with markings for whether or not I want them and why. Hopefully this illustrates the issue.

Good:

  • example.com/page - it links to another page on the same domain
  • example.com/page.html - has a filetype ending but it's an HTML page
  • subdomain.example.com/page.html - it's on the same site, though on a subdomain
  • /about/us - it's a relative link, so it doesn't have the domain it it, but it's implied

Bad:

  • otherexample.com/page - bad, the domain doesn't match
  • example.com/image.jpg - bad, it's an image and not a page
  • / - bad - sometimes there's just a slash in the "a" tag, but that's a reference to the page I'm already on
  • #anchor - this is also a relative link, but it's on the same page, so there's no need for it

I've been writing cases in if statements for each of these...but there has to be a better way!


Edit: Here's my current code, which returns nothing:

ignore_values = {"", "/"}
def desired_links(href):
     # ignore if href is not set
     if not href:
         return False

     # ignore if it is just a link to the same page
     if href.startswith("#"):
         return False

     # skip ignored values
     if href in ignore_values:
         return False


 def explorePage(pageURL):
 #Get web page
     opener = urllib2.build_opener()
     opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
     response = opener.open(pageURL)
     html = response.read()

     #Parse web page for links
     soup = BeautifulSoup(html, 'html.parser')
     links = [a["href"] for a in soup.find_all("a", href=desired_links)]
     for link in links:
         print(link)

     return



 def main():
     explorePage("http://xkcd.com")
Jake
  • 3,142
  • 4
  • 30
  • 48
  • You just have to create some rules and apply them to each href – Padraic Cunningham Oct 09 '16 at 18:53
  • That's what I've been doing - ignore if it's just a slash...remove the http:// if it exists and make sure it says example.com before the first slash, otherwise ignore...remove all # and anything that follows, if it's now empty then ignore...is writing them out manually the only way to do it? No libraries out there that could help? – Jake Oct 09 '16 at 18:54
  • 1
    A big step to filtering would be `soup.select("a[href*=example.com]")` – Padraic Cunningham Oct 09 '16 at 18:54
  • Oooh, now that's a really nice one. Thank you!! – Jake Oct 09 '16 at 18:55

1 Answers1

4

BeautifulSoup is quite flexible in helping you to create and apply the rules to attribute values. You can create a filtering function and use it as a value for the href argument to find_all().

For example, something for you to start with:

ignore_values = {"", "/"}
def desired_links(href):
    # ignore if href is not set
    if not href:
        return False

    # ignore if it is just a link to the same page
    if href.startswith("#"):
        return False

    # skip ignored values
    if href in ignore_values:
        return False

    # TODO: more rules
    # you would probably need "urlparse" package for a proper url analysis

    return True

Usage:

links = [a["href"] for a in soup.find_all("a", href=desired_links)]

You should take a look at Scrapy and its Link Extractors.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • 1
    This is awesome. Man, Python comes with the coolest things. I have some questions, though - first, there's an "if '#' in href", but sometimes there are links to other pages on the site (valid) that just happen to have anchor tags in them. So I only want to ignore the anchor tags alone, so I guess just strip them? It appears that this would ignore any URL with a # in it, is that right? – Jake Oct 09 '16 at 18:59
  • @Jake yeah, good catch, this part has to be improved to take into account the current page only. – alecxe Oct 09 '16 at 19:01
  • 1
    `if not href.startswith("#")` would probably work, or mayble `if not href.startswith("#") or not href.endswith(("#", ".png", ".jpg"))`, you can pass a tuple of args to both endswith and startswith – Padraic Cunningham Oct 09 '16 at 19:02
  • I just updated my question with some example code, heavily based on what's here - unfortunately, it returns nothing - any idea why? – Jake Oct 09 '16 at 19:16
  • @Jake sure, do you have the `return True` in the end of the function? – alecxe Oct 09 '16 at 19:17
  • @alecxe Ah there we go, I see now that you just added that in - thank you very much!! It now works perfectly as expected. One last thing - can it be combined with Padriac's selector, which limits it to only links on the same domain? – Jake Oct 09 '16 at 19:39
  • @Jake no, you cannot combine the `select()` and `find_all()` in a single call, but you can check the domain in a more reliable way using `urlparse()` - see http://stackoverflow.com/questions/9626535/get-domain-name-from-url. Hope that helps. – alecxe Oct 10 '16 at 03:34