tl;dr your seed value is not a list, tocrawl shouldn't be one (it should instead be a data structure which supports unions, such as a set)
This is your basic graph traversal algorithm. It's hard to explain the importance of the one variable without understanding its context, so allow me to over-indulge you. >:)
The web is what mathematicians like to call a graph. Each address on the web is a node in the graph and a link (which joins two addresses) is called an edge. So "crawling the web" is just finding paths along link-edges to different address-nodes in the web-graph.
def crawl_web(seed):
At every node you're going to have to check for new nodes to visit which you can reach from the current one. But you have to start somewhere; here the code says to start at node seed
by initializing the list of known, unvisited nodes to seed
:
tocrawl = [seed]
You also want to keep track of visited nodes so you don't keep going around in circles:
crawled = []
Then you have to begin your traversal. Keep crawling until there are no more nodes to crawl. :
while tocrawl:
Each iteration of the loop will visit another node, which we get from our list (initially just the seed
node) by popping a single value off the end:
page = tocrawl.pop()
Don't visit previously crawled nodes; just continue:
if page not in crawled:
I don't think in general you can do unions on lists in python, you'd probably have to make a set to do something like this:
# union(tocrawl, get_all_links(get_page(page)))
but the gist of it is that you're collecting all the edges from the current node and adding their nodes to the list (set?) of known, uncrawled addresses. You could define a list union function the following way, noting that it didn't preserve order:
def list_union(a, b):
# "|" is the set union operator
return list(set(a) | set(b))
Lastly just remember that you've visited the current address already! Clicking around in link circles (cycles is the technical term, loops if a node has an edge to itself)
crawled.append(page)
And when you're all done, return the full list of addresses reachable from your starting address by following links (the ones you've been to).
return crawled
Now let's see the whole thing in context:
def crawl_web(seed):
tocrawl = [seed]
crawled = []
while tocrawl:
page = tocrawl.pop()
if page not in crawled:
list_union(tocrawl, get_all_links(get_page(page)))
crawled.append(page)
return crawled