I am trying to parallelize scraping a website using BeautifulSoup in Python.
I give a url and a depth variable to a function and it looks something like this :
def recursive_crawl(url, depth):
if depth == 0:
return
links = fetch_links(url) # where I use the requests library
print("Level : ", depth, url)
for link in links:
if is_link_valid(link) is True:
recursive_crawl( URL , depth - 1)
The output is partially like :
Level : 4 site.com
Level : 3 site.com/
Level : 2 site.com/research/
Level : 1 site.com/blog/
Level : 1 site.com/stories/
Level : 1 site.com/list-100/
Level : 1 site.com/cdn-cgi/l/email-protection
and so on.
My problem is this:
I am using a set()
, to avoid going to already visited links so I have a shared memory problem. Can I implement this recursive web crawler in parallel ?
Note: Please don't recommend scrapy, I want it done with a parsing library like BeautifulSoup.