This can be done using recursive functions.
A basic example which will crawl all the urls found in a page one by one, then crawl all the urls found in that page one by one, and so on... It will also print out every url it finds.
def recursive_fetch(url_to_fetch):
# get page source from url_to_fetch
# make a new soup
for link in soup.find_all('a'):
url = link.get('href')
print url
# run the function recursively
# for the current url
recursive_fetch(url)
# Usage
recursive_fetch(root_url)
Since you want to have a dict of tree of all the urls found, the above code isn't much helpful, but it's a start.
This is where it gets really complex. Because now you'll also need to keep track of the parent of the current url being crawled, the parent of that url, parent of that url, parent of that url, parent of ...
You see, what I mean? It gets very complex, very fast.
Below is the code that does all that. I've written comments in the code to explain it as best as I can. But you'll need to actually understand how recursive functions work for a better understanding.
First, let's look at another function which will be very helpful in getting the parent of a url from the tree:
def get_parent(tree, parent_list):
"""How it works:
Let's say the `tree` looks like this:
tree = {
'root-url': {
'link-1': {
'link-1-a': {...}
}
}
}
and `parent_list` looks like this:
parent_list = ['root-url', 'link-1', 'link-1-a']
this function will chain the values in the list and
perform a dict lookup like this:
tree['root-url']['link-1']['link-1-a']
"""
first, rest = parent_list[0], parent_list[1:]
try:
if tree[first] and rest:
# if tree or rest aren't empty
# run the function recursively
return get_parent(tree[first], rest)
else:
return tree[first]
except KeyError:
# this is required for creating the
# root_url dict in the tree
# because it doesn't exist yet
tree[first] = {}
return tree[first]
And the recursive_fetch
function will look like this:
url_tree = {} # dict to store the url tree
def recursive_fetch(fetch_url, parents=None):
"""
`parents` is a list of parents of the current url
Example:
parents = ['root-url', 'link-1', ... 'parent-link']
"""
parents = parents or []
parents.append(fetch_url)
# get page source from fetch_url
# make new soup object
for link in soup.find_all('a'):
url = link.get('href')
if parents:
parent = get_parent(url_tree, parents)
else:
parent = None
if parent is not None:
# this will run when parent is not None
# i.e. even if parent is empty dict {}
# create a new dict of the current url
# inside the parent dict
parent[url] = {}
else:
# this url has no parent,
# insert this directly in the url_tree
url_tree[url] = {}
# now crawl the current url
recursive_fetch(url, parents)
# Next is the most import block of code
# Whenever 1 recursion completes,
# it will pop the last parent from
# the `parents` list so that in the
# next recursion, the parents are correct.
# Whithout this block, the url_tree wouldn't
# look as expected.
# It took me many hours to figure this out
try:
parents.pop(-1)
except IndexError:
pass