I'm new to Python, and extremely impressed by the amount of libraries at my disposal. I have a function already that uses Beautiful Soup to extract URLs from a site, but not all of them are relevant. I only want webpages (no media) on the same website (domain or subdomain, but no other domains). I'm trying to manually program around examples I run into, but I feel like I'm reinventing the wheel - surely this is a common problem in internet applications.
Here's an example list of URLs that I might retrieve from a website, say http://example.com, with markings for whether or not I want them and why. Hopefully this illustrates the issue.
Good:
example.com/page
- it links to another page on the same domainexample.com/page.html
- has a filetype ending but it's an HTML pagesubdomain.example.com/page.html
- it's on the same site, though on a subdomain/about/us
- it's a relative link, so it doesn't have the domain it it, but it's implied
Bad:
otherexample.com/page
- bad, the domain doesn't matchexample.com/image.jpg
- bad, it's an image and not a page/
- bad - sometimes there's just a slash in the "a" tag, but that's a reference to the page I'm already on#anchor
- this is also a relative link, but it's on the same page, so there's no need for it
I've been writing cases in if
statements for each of these...but there has to be a better way!
Edit: Here's my current code, which returns nothing:
ignore_values = {"", "/"}
def desired_links(href):
# ignore if href is not set
if not href:
return False
# ignore if it is just a link to the same page
if href.startswith("#"):
return False
# skip ignored values
if href in ignore_values:
return False
def explorePage(pageURL):
#Get web page
opener = urllib2.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
response = opener.open(pageURL)
html = response.read()
#Parse web page for links
soup = BeautifulSoup(html, 'html.parser')
links = [a["href"] for a in soup.find_all("a", href=desired_links)]
for link in links:
print(link)
return
def main():
explorePage("http://xkcd.com")