I am working to write a web-scraper for a school project which catalogs all valid URLs found on the page, and can follow a URL to the next webpage and perform the same action; up to a set number of layers.
quick code intent:
- function takes a BeautifulSoup type, a url (to indicate where it started), the layer count, and the maximum layer depth
- check page for all href lines
- append to list of 'results' is populated each time an href tag is found containing a valid url (starts with http, https, HTTP, HTTPS; which I know may not be the perfect way to check but for now its what I am working with)
- the layer is incremented by 1 each time a valid URL is found the recursiveLinkSearch() function called again
- when layer count is reached, or no href's remain, return results list
I am very out of practice with recursion, and am hitting an issue with python adding a 'None' to the list "results" at the end of the recursion.
This link [https://stackoverflow.com/questions/61691657/python-recursive-function-returns-none] indicates that it may be where I am exiting my function from. I am also not sure I have recursion operating properly because of the nested for loop.
Any help or insight on recursion exit strategy is greatly appreciated.
def curlURL(url):
# beautify with BS
soup = BeautifulSoup(requests.get(url, timeout=3).text, "html.parser")
return soup
def recursiveLinkSearch(soup, url, layer, depth):
results = []
# for each 'href' found on the page, check if it is a URL
for a in soup.find_all(href=True):
try:
# for every href found, check if contains http or https
if any(stringStartsWith in a.get('href')[0:4] for stringStartsWith in ["http", "https", "HTTP", "HTTPS"]) \
and a.get('href') != url and layer < depth:
print(f"Found URL: {a.get('href')}")
print(f"LOG: {colors.yellow}Current Layer: {layer}{colors.end}")
results.append(a.get('href'))
# BUG: adds an extra "None" type to the end of each list
results.append(recursiveLinkSearch(curlURL(a.get('href')), a.get('href'), layer+1, depth))
# Exceptions Stack
except requests.exceptions.InvalidSchema:
print(f"{a.get('href')}")
print(f"{colors.bad}Invalid Url Detected{colors.end}")
except requests.exceptions.ConnectTimeout:
print(f"{a.get('href')}")
print(f"{colors.bad}Connection Timeout. Passing...")
except requests.exceptions.SSLError:
print(f"{a.get('href')}")
print(f"{colors.bad}SSL Certificate Error. Passing...")
except requests.exceptions.ReadTimeout:
print(f"{a.get('href')}")
print(f"{colors.bad}Read Timeout. Passing...")
# exit recursion
if results != []:
print(f"LOG: {results[-1]}")
return results