0

I am working to write a web-scraper for a school project which catalogs all valid URLs found on the page, and can follow a URL to the next webpage and perform the same action; up to a set number of layers.

quick code intent:

  1. function takes a BeautifulSoup type, a url (to indicate where it started), the layer count, and the maximum layer depth
  2. check page for all href lines
  3. append to list of 'results' is populated each time an href tag is found containing a valid url (starts with http, https, HTTP, HTTPS; which I know may not be the perfect way to check but for now its what I am working with)
  4. the layer is incremented by 1 each time a valid URL is found the recursiveLinkSearch() function called again
  5. when layer count is reached, or no href's remain, return results list

I am very out of practice with recursion, and am hitting an issue with python adding a 'None' to the list "results" at the end of the recursion.

This link [https://stackoverflow.com/questions/61691657/python-recursive-function-returns-none] indicates that it may be where I am exiting my function from. I am also not sure I have recursion operating properly because of the nested for loop.

Any help or insight on recursion exit strategy is greatly appreciated.

def curlURL(url):
    # beautify with BS
    soup = BeautifulSoup(requests.get(url, timeout=3).text, "html.parser")
    return soup


def recursiveLinkSearch(soup, url, layer, depth):
    results = []
    # for each 'href' found on the page, check if it is a URL
    for a in soup.find_all(href=True):
        try:
            # for every href found, check if contains http or https
            if any(stringStartsWith in a.get('href')[0:4] for stringStartsWith in ["http", "https", "HTTP", "HTTPS"]) \
                    and a.get('href') != url and layer < depth:

                print(f"Found URL: {a.get('href')}")
                print(f"LOG: {colors.yellow}Current Layer: {layer}{colors.end}")
                results.append(a.get('href'))
                # BUG: adds an extra "None" type to the end of each list
                results.append(recursiveLinkSearch(curlURL(a.get('href')), a.get('href'), layer+1, depth))
        # Exceptions Stack
        except requests.exceptions.InvalidSchema:
            print(f"{a.get('href')}")
            print(f"{colors.bad}Invalid Url Detected{colors.end}")
        except requests.exceptions.ConnectTimeout:
            print(f"{a.get('href')}")
            print(f"{colors.bad}Connection Timeout. Passing...")
        except requests.exceptions.SSLError:
            print(f"{a.get('href')}")
            print(f"{colors.bad}SSL Certificate Error.  Passing...")
        except requests.exceptions.ReadTimeout:
            print(f"{a.get('href')}")
            print(f"{colors.bad}Read Timeout.  Passing...")
    # exit recursion
    if results != []:
        print(f"LOG: {results[-1]}")
        return results
Gabriel C.
  • 77
  • 9
  • 2
    If no values are added to the results list, your recursiveLinkSearch function doesn't explicitly return anything, which means it implicitly returns None. Your test for `if results != []` isn't correct. The code needs to return something (or the calling code needs to be prepared to test for None and not append to the results list). Also, be aware of the difference between append and extend with lists. You might need to use extend. – jarmod Aug 23 '21 at 23:23
  • Thank you, appreciate the insight on extend. the intent of the nested lists was to keep track of where the link was found. Use case would be a user crawls 3 layers deep, they might want to know that a tag was found on link2 of layer 2, link 4 of layer 3. What I was going for is a multi-dimensional array data-construct, but if there is a better way to index this I would love to know. – Gabriel C. Aug 24 '21 at 12:42

1 Answers1

1

This is not a recursion problem. In the end, if results != []: you print something and return results. Else your function just ends and returns nothing. But in python, if your append the value of function that returned nothing - you get None. So when your result is left empty - you are getting None.

You can either check what you are appending or pop() if you got None after appending.

Rustam A.
  • 809
  • 8
  • 15