1

I'm creating a program that uses a dictionary to store large trees of web links in Python. Basically, you start with the root URL, and that creates a dictionary based on the URLs found from the HTML of the root. In the next step, I want to get the pages of each of those URLs and grab the links on those URLs. Eventually, I want to have a dictionary with all the links in it and their relation to each other.

This is what I have for the first two depths

for link in soup.find_all('a'):
   url = link.get('href')
   url_tree[siteurl][url]
 #Get page source
   for link in soup.find_all('a'):
     url = link.get('href')
     url_tree[siteurl][secondurl][url]

This system works, but as you can tell, if I want a dictionary N layers deep, that gets to be a lot of blocks of code. Is there a way to automatically add more layers? Any help is appreciated!

1 Answers1

0

This can be done using recursive functions.

A basic example which will crawl all the urls found in a page one by one, then crawl all the urls found in that page one by one, and so on... It will also print out every url it finds.

def recursive_fetch(url_to_fetch):
    # get page source from url_to_fetch
    # make a new soup
    for link in soup.find_all('a'):
        url = link.get('href')
        print url
        # run the function recursively 
        # for the current url
        recursive_fetch(url)


# Usage
recursive_fetch(root_url)

Since you want to have a dict of tree of all the urls found, the above code isn't much helpful, but it's a start.

This is where it gets really complex. Because now you'll also need to keep track of the parent of the current url being crawled, the parent of that url, parent of that url, parent of that url, parent of ...

You see, what I mean? It gets very complex, very fast.

Below is the code that does all that. I've written comments in the code to explain it as best as I can. But you'll need to actually understand how recursive functions work for a better understanding.

First, let's look at another function which will be very helpful in getting the parent of a url from the tree:

def get_parent(tree, parent_list):
    """How it works:

    Let's say the `tree` looks like this:

        tree = {
            'root-url': {
                'link-1': {
                    'link-1-a': {...}
                }
            }
        }

    and `parent_list` looks like this:

        parent_list = ['root-url', 'link-1', 'link-1-a']

    this function will chain the values in the list and 
    perform a dict lookup like this:

        tree['root-url']['link-1']['link-1-a']
    """

    first, rest = parent_list[0], parent_list[1:]
    try:
        if tree[first] and rest:
            # if tree or rest aren't empty
            # run the function recursively
            return get_parent(tree[first], rest)
        else:
            return tree[first]
    except KeyError:
        # this is required for creating the 
        # root_url dict in the tree
        # because it doesn't exist yet

        tree[first] = {}
        return tree[first]

And the recursive_fetch function will look like this:

url_tree = {} # dict to store the url tree


def recursive_fetch(fetch_url, parents=None):
    """
    `parents` is a list of parents of the current url
     Example:
         parents = ['root-url', 'link-1', ... 'parent-link']
    """
    parents = parents or []

    parents.append(fetch_url)

    # get page source from fetch_url
    # make new soup object

    for link in soup.find_all('a'):
        url = link.get('href')

        if parents:
            parent = get_parent(url_tree, parents)
        else:
            parent = None

        if parent is not None:
            # this will run when parent is not None
            # i.e. even if parent is empty dict {}

            # create a new dict of the current url
            # inside the parent dict
            parent[url] = {}
        else:
            # this url has no parent, 
            # insert this directly in the url_tree
            url_tree[url] = {}

        # now crawl the current url
        recursive_fetch(url, parents)

    # Next is the most import block of code
    # Whenever 1 recursion completes, 
    # it will pop the last parent from 
    # the `parents` list so that in the 
    # next recursion, the parents are correct.
    # Whithout this block, the url_tree wouldn't 
    # look as expected.
    # It took me many hours to figure this out

    try:
        parents.pop(-1)
    except IndexError:
        pass
xyres
  • 20,487
  • 3
  • 56
  • 85
  • This is exactly what I'm looking for! I had originally thought of using a recursive function for the web pages, but I had no idea how to do that and keep track of the URLs. I saw this [post](https://stackoverflow.com/questions/19937210/dynamically-set-deep-python-dict), but wasn't sure about how to make the leap from that to my program. Thanks for your help! – SheepyBloke Dec 08 '17 at 20:58