2

What's the purpose of the second line? Don't we have to pass in a list as our seed parameter anyway? I'd think you could just use seed in all the areas we have our tocrawl variable, instead of having a list inside a list.

def crawl_web(seed):
    tocrawl = [seed]
    crawled = []
    while tocrawl:
        page = tocrawl.pop()
        if page not in crawled:
            union(tocrawl, get_all_links(get_page(page)))
            crawled.append(page)
    return crawled

EDIT: full script - http://codepad.org/qxuzVkof

ZCJ
  • 499
  • 2
  • 9
  • 17
  • 1
    That depends. What does `union()` do? – Greg Hewgill Mar 13 '12 at 23:58
  • whoops- `def union(p,q): for e in q: if e not in p: p.append(e)` – ZCJ Mar 14 '12 at 00:03
  • http://stackoverflow.com/questions/480214/how-do-you-remove-duplicates-from-a-list-in-python-whilst-preserving-order Check out this question if you want to see a high-performance solution to your list union operation, which I'd imagine would be useful considering the fact that the internet is huge and your above algorithm is O(n*m). – machine yearning Mar 14 '12 at 02:36
  • Aha, by the way, your algorithm is called a "breadth-first search" because it processes the addresses it sees first, first. – machine yearning Mar 14 '12 at 02:38

3 Answers3

6

seed is not a list, but a single item, most likely the URL of the first page to crawl.

While one could change the design of the function to allow for multiple seed URLs, that would complicate the function's signature - almost every caller will just want to crawl from one URL. It is also bad style to have a function modify the caller's variables; if the function would take a parameter tocrawl, you'd have to document (and get every caller to know) that:

  • The tocrawl represents all the URLs that serve as initial starting points
  • tocrawl must be a list (or a compatible type); an ordinary iterable does not support pop. It must also conform to the properties that union expects. This is severely unpythonic.
  • If the function exits, the argument list will be empty. This means almost any caller will want to pass a copy of the list.
  • If the function exits with an exception, tocrawl will contain a snapshot of URLs yet to crawl.

Since the last two points essentially mean that the input parameter will get corrupted, a better design would be to make the argument an arbitrary iterable, and then construct a list. That way, the caller doesn't have to worry about all these silly conventions.

One more thing: Instead of writing your own union implementation and using a list to store all crawled websites, you should consider using a set in both instances, like this:

def crawl_web(seed):
    tocrawl = set([seed])
    crawled = set()
    while tocrawl:
        page = tocrawl.pop()
        if page not in crawled:
            tocrawl.update(get_all_links(get_page(page)))
            crawled.add(page)
    return crawled
phihag
  • 278,196
  • 72
  • 453
  • 469
  • Using a set to `tocrawl` means you crawl things in an arbitrary order. The `union` function given appends new links to the end of the list, with the result that the order in which pages are crawled is predictable (and any given link reached through an arbitrary number of links from the seed *will* eventually get crawled, even if the number of reachable pages is unbounded, although that's likely not a practical benefit). I don't know whether that makes a difference or not to the OP. Likewise `crawled` looks much better as a set, but that means you don't get the pages returned in crawl order. – Ben Mar 14 '12 at 01:36
2

tl;dr your seed value is not a list, tocrawl shouldn't be one (it should instead be a data structure which supports unions, such as a set)

This is your basic graph traversal algorithm. It's hard to explain the importance of the one variable without understanding its context, so allow me to over-indulge you. >:)

The web is what mathematicians like to call a graph. Each address on the web is a node in the graph and a link (which joins two addresses) is called an edge. So "crawling the web" is just finding paths along link-edges to different address-nodes in the web-graph.

def crawl_web(seed):

At every node you're going to have to check for new nodes to visit which you can reach from the current one. But you have to start somewhere; here the code says to start at node seed by initializing the list of known, unvisited nodes to seed:

    tocrawl = [seed]

You also want to keep track of visited nodes so you don't keep going around in circles:

    crawled = []

Then you have to begin your traversal. Keep crawling until there are no more nodes to crawl. :

    while tocrawl:

Each iteration of the loop will visit another node, which we get from our list (initially just the seed node) by popping a single value off the end:

        page = tocrawl.pop()

Don't visit previously crawled nodes; just continue:

        if page not in crawled:

I don't think in general you can do unions on lists in python, you'd probably have to make a set to do something like this:

            # union(tocrawl, get_all_links(get_page(page)))

but the gist of it is that you're collecting all the edges from the current node and adding their nodes to the list (set?) of known, uncrawled addresses. You could define a list union function the following way, noting that it didn't preserve order:

def list_union(a, b):
    # "|" is the set union operator
    return list(set(a) | set(b))

Lastly just remember that you've visited the current address already! Clicking around in link circles (cycles is the technical term, loops if a node has an edge to itself)

     crawled.append(page)

And when you're all done, return the full list of addresses reachable from your starting address by following links (the ones you've been to).

    return crawled

Now let's see the whole thing in context:

def crawl_web(seed):
    tocrawl = [seed]
    crawled = []
    while tocrawl:
        page = tocrawl.pop()
        if page not in crawled:
            list_union(tocrawl, get_all_links(get_page(page)))
            crawled.append(page)
    return crawled
machine yearning
  • 9,889
  • 5
  • 38
  • 51
  • You say both that `tocrawl` isn't a list, and "here the code says to start at node seed by initializing the **list** of known, unvisited nodes to seed" (emphasis mine). `tocrawl` is most definitely a list. But +1 for the clear explanation of the algorithm. – Ben Mar 14 '12 at 01:45
  • Ahh, what I meant by that is that whatever he was using for `tocrawl` needed to be taken as a parameter to a union `operation`, which is not true of Python lists. Sorry for the confusion, fixing. – machine yearning Mar 14 '12 at 02:29
  • I think the script the OP is using has `union` defined as a function operating on lists (code given in a comment on the main question). Defining a union function for lists rather than using a set may well be deliberate, since it preserves ordering that would otherwise be thrown away. The primitive operations of lists **do** support defining a union operation externally; it doesn't matter that Python `list` doesn't come with a union method built in. So I think even saying it `tocrawl` shouldn't be a list is a bit strong without knowing that the OP doesn't care about the crawl order. – Ben Mar 14 '12 at 04:25
  • Yeah, he only just posted that. I posted a link to some efficient order-preserving list unions as a comment on the original post, see above. – machine yearning Mar 14 '12 at 05:55
1

In this algorithm, seed is a page, not a list.

tocrawl is a stack where you collect linked pages you want to visit in order to find more links.

Carles Barrobés
  • 11,608
  • 5
  • 46
  • 60
  • 1
    yes, right! It is the first page where the crawling will start.. For this reason it is named as "seed". – Thanasis Petsas Mar 14 '12 at 00:00
  • How do you just assign a variable to a page inside a list? Here it looks like we're assigning whatever is in seed to a list- wouldn't that just make the list `tocrawl` have a single item, which is the string of source code from `seed` page? – ZCJ Mar 14 '12 at 00:10
  • @ZJC The `tocrawl` list is *initialised* as a list containing the single item `seed`, but inside the loop you keep adding all the links found on the page to the `tocrawl` list. There would be no point in the loop head `while tocrawl` if `tocrawl` isn't being updated during the loop body. – Ben Mar 14 '12 at 01:39