0

I am a new Stack Overflow member so please let me know if and how I can improve this question. I am working on a Python script which will take a link to a website's home page, and then search for a specific URL throughout the entire website (not just that first homepage). The reason for this is that my research team would like to query a list of websites for a URL to a particular database, without having to click through every single page to find it. It is essentially a task of saying "Does this website reference this database? If so, how many times?" and then keeping that information for our records. So far, I have been able to use resources on SO and other pages to create a script that will scrape the HTML of the specific webpage I have referenced, and I have included this script for review.

import requests  
from bs4 import BeautifulSoup  

url = raw_input("Enter the name of the website you'd like me to check, followed by a space:")

r = requests.get(url)

soup = BeautifulSoup(r.content, features='lxml')

links = soup.find_all("a")
for link in links:
    if "http" and "dataone" in link.get("href"):
        print("<a href='%s'>%s</a>" %(link.get("href"), link.text))

As you can see, I am looking for a URL linking to a particular database (in this case, DataONE) after being given a website URL by the user. This script works great, but it only scrapes that particular page that I link -- NOT the entire website. So, if I provide the website: https://www.lib.utk.edu/, it will only search for references to DataONE within this page but it will not search for references across all of the pages under the UTK Libraries website. **I do not have a high enough reputation on this site yet to post pictures, so I am unable to include an image of this script "in action." **

I've heavily researched this on SO to try and gain insight, but none of the questions asked or answered thus far apply to my specific problem.

Examples:
1. How can I loop scraping data for multiple pages in a website using python and beautifulsoup4: in this particular question, the OP can find out how many pages they need to search through because their problem refers to a specific search made on a site. However, in my case, I will not know how many pages there are in each website.
2. Use BeautifulSoup to loop through and retrieve specific URLs: Again, this is dealing with parsing through URLs but it is not looking through an entire website for URLs.
3. How to loop through each page of website for web scraping with BeautifulSoup: The OP here seems to be struggling with the same problem I am having, but the accepted answer there does not provide enough detail for understanding HOW to approach a problem like this.

I've scoured the BeautifulSoup documentation but I have not found any help with web scraping an entire website from a single URL (and not knowing how many total pages are in the website). I've looked into using Scrapy, but I'm not sure it's what I need for my purposes on this project, because I am not trying to download or store data -- I am simply trying to see when and where a certain URL is referenced on an entire website.

My question: Is doing something like this possible with BeautifulSoup, and if so, can you suggest how I should change my current code to handle my research problem? Or is there another program I should look into using?

GeoMoon
  • 13
  • 6
  • Welcome to SO. Please boil down your question to a specific and at the same time short request without redundancy. That will make it much easier to help you. Thanks! – petezurich Oct 18 '18 at 13:49
  • Hi @petezurich, thank you for the comment. I'm a bit confused how to do that, as I've seen similar posts to this but were much shorter, and they received the comment of being too vague. Can you recommend posts that achieve this middle ground? – GeoMoon Oct 18 '18 at 13:54
  • I get your point. Your post still seems rather long. I simply suggest to take out all text that is not neccessary to understand your problem. This will invite more people to read it and help you. – petezurich Oct 18 '18 at 14:18

2 Answers2

1

You could use two python sets to keep track of pages you already visited and of pages you need to visit.

Also: you if condition is wrong, to test both , you cannot use a and b in c you need to do a in c and b in c

Something like this:

import requests  
from bs4 import BeautifulSoup 


baseurl = 'https://example.org'
urls_to_check = {baseurl, }
checked_urls = set()

found_links = []
while urls_to_check:
    url = urls_to_check.pop()
    r = requests.get(url)

    soup = BeautifulSoup(r.content, features='lxml')

    links = soup.find_all("a")
    for link in links:
        if "http" in link.get("href") and "dataone" in link.get("href"):
            found_links.append("<a href='%s'>%s</a>" % (link.get("href"), link.text))
        elif link.get("href", "").startswith("/"):
            if baseurl + link.get("href") not in checked_urls:
                urls_to_check.add(baseurl + link.get("href"))
    checked_urls.add(url)
MaxNoe
  • 14,470
  • 3
  • 41
  • 46
  • Sorry, I used the same name for what should have been to variables – MaxNoe Oct 18 '18 at 14:02
  • 1
    Now, in the end, found_links should contain all matches – MaxNoe Oct 18 '18 at 14:03
  • Your code assumes all internal links are of the form `href="/info.html"`. This might not hold everywhere! You'd have to check for all possible forms of internal links, e.g. `href="http://the-site.org/info.html"`, `href="https://the-site.org/info.html"`, `href="//the-site.org/info.html"` etc – Oliver Baumann Oct 18 '18 at 14:15
  • 1
    True, but this should be a working solution and easy to extend for more cases – MaxNoe Oct 18 '18 at 14:21
  • Thanks @MaxNoe! I am going to play around with your code for a bit as I was having problems getting it to run correctly (this is probably on my end since I am a newbie)-- since I'm still so new to coding, it takes me awhile to parse through code and understand each line. I appreciate you doing this and helping me out! – GeoMoon Oct 18 '18 at 18:30
1

You will need to implement some form of crawler.

This can be done manually; essentially, you'd do this:

  1. check if a robots.txt exists and parse it for URLs, adding them to a list to visit later
  2. parse whatever the first page is you visit for further links; you will probably search for all <a> elements and parse out their href, then figure out if the link is to the same site, e.g. href="/info.html", but also href="http://lib.edu.org/info.html"
  3. add the identified URLs to a list of URLs to visit
  4. repeat from 2 until all URLs have been visited

I'd recommend looking into Scrapy though. It lets you define Spiders that you feed with information about what URLs to start at and how to generate further links to visit. The Spider has a parse method that you can utilize to search for your database. In case of a match, you could update a local SQLite-DB or simply write a count to a textfile.

TL;DR: from visiting a single page, it is hard to identify what other pages exist. You have to parse all internal links. A robots.txt can be helpful in this effort, but is not guaranteed to exist.

Oliver Baumann
  • 2,209
  • 1
  • 10
  • 26
  • Thank you @Oliver Baumann! This is very useful to know -- I will dig into the Scrapy documentation and check out parsing with a Spider. I appreciate your kind help! – GeoMoon Oct 18 '18 at 18:29