I am a new Stack Overflow member so please let me know if and how I can improve this question. I am working on a Python script which will take a link to a website's home page, and then search for a specific URL throughout the entire website (not just that first homepage). The reason for this is that my research team would like to query a list of websites for a URL to a particular database, without having to click through every single page to find it. It is essentially a task of saying "Does this website reference this database? If so, how many times?" and then keeping that information for our records. So far, I have been able to use resources on SO and other pages to create a script that will scrape the HTML of the specific webpage I have referenced, and I have included this script for review.
import requests
from bs4 import BeautifulSoup
url = raw_input("Enter the name of the website you'd like me to check, followed by a space:")
r = requests.get(url)
soup = BeautifulSoup(r.content, features='lxml')
links = soup.find_all("a")
for link in links:
if "http" and "dataone" in link.get("href"):
print("<a href='%s'>%s</a>" %(link.get("href"), link.text))
As you can see, I am looking for a URL linking to a particular database (in this case, DataONE) after being given a website URL by the user. This script works great, but it only scrapes that particular page that I link -- NOT the entire website. So, if I provide the website: https://www.lib.utk.edu/, it will only search for references to DataONE within this page but it will not search for references across all of the pages under the UTK Libraries website. **I do not have a high enough reputation on this site yet to post pictures, so I am unable to include an image of this script "in action." **
I've heavily researched this on SO to try and gain insight, but none of the questions asked or answered thus far apply to my specific problem.
Examples:
1. How can I loop scraping data for multiple pages in a website using python and beautifulsoup4: in this particular question, the OP can find out how many pages they need to search through because their problem refers to a specific search made on a site. However, in my case, I will not know how many pages there are in each website.
2. Use BeautifulSoup to loop through and retrieve specific URLs: Again, this is dealing with parsing through URLs but it is not looking through an entire website for URLs.
3. How to loop through each page of website for web scraping with BeautifulSoup: The OP here seems to be struggling with the same problem I am having, but the accepted answer there does not provide enough detail for understanding HOW to approach a problem like this.
I've scoured the BeautifulSoup documentation but I have not found any help with web scraping an entire website from a single URL (and not knowing how many total pages are in the website). I've looked into using Scrapy, but I'm not sure it's what I need for my purposes on this project, because I am not trying to download or store data -- I am simply trying to see when and where a certain URL is referenced on an entire website.
My question: Is doing something like this possible with BeautifulSoup, and if so, can you suggest how I should change my current code to handle my research problem? Or is there another program I should look into using?