1

I am trying to parallelize scraping a website using BeautifulSoup in Python.

I give a url and a depth variable to a function and it looks something like this :

def recursive_crawl(url, depth):
    if depth == 0:
        return 

    links = fetch_links(url) # where I use the requests library

    print("Level : ", depth, url)
    for link in links:
        if is_link_valid(link) is True:
            recursive_crawl( URL , depth - 1)

The output is partially like :

Level :  4 site.com
Level :  3 site.com/
Level :  2 site.com/research/
Level :  1 site.com/blog/
Level :  1 site.com/stories/
Level :  1 site.com/list-100/
Level :  1 site.com/cdn-cgi/l/email-protection

and so on.

My problem is this:

I am using a set(), to avoid going to already visited links so I have a shared memory problem. Can I implement this recursive web crawler in parallel ?

Note: Please don't recommend scrapy, I want it done with a parsing library like BeautifulSoup.

Marios Ath
  • 618
  • 2
  • 9
  • 21
  • Does this answer your question? [Links of a page and links of that subpages. Recursion/Threads](https://stackoverflow.com/questions/53402464/links-of-a-page-and-links-of-that-subpages-recursion-threads) – ggorlen Oct 02 '20 at 18:08

1 Answers1

1

I don't think the parsing library you use matters very much. It seems that what you're asking about is how to manage the threads. Here's my general approach:

  1. Establish a shared queue of URLs to be visited. Although the main contents are URL strings, you probably want to wrap those with some supplemental information: depth, referring links, and any other contextual information your spider's going to want.
  2. Build a gatekeeper object which maintains a list of already-visited URLs. (That set is the one you've mentioned.) The object has a method which takes a URL and decides whether to add it to the Queue in #1. (Submitted URLs are also added to the set. You might also strip the URLs of GET parameters before adding it.) During setup / instantiation, it might take parameters which limit the crawl for other reasons. Depending on the variety of Queue you've chosen, you might also do some prioritization in this object. Since you're using this gatekeeper to wrap the Queue's input, you probably also want to wrap the Queue's output.
  3. Launch several worker Threads. You'll want to make sure each Thread is instantiated with a parameter referencing the single gatekeeper instance. That Thread contains a loop with all your page-scraping code. Each iteration of the loop consumes one URL from the gatekeeper Queue, and submits any discovered URLs to the gatekeeper. (The thread doesn't care whether the URLs are already in the queue. That responsibility belongs to the gatekeeper.)
Sarah Messer
  • 3,592
  • 1
  • 26
  • 43