-1

I am making a web crawler for which I am using the following two functions:

#Each queued link is the new job
def create_jobs():
    for link in file_to_set(QUEUE_FILE):
        queue.put(link)
    queue.join()
    crawl()

#Check if there are items in the queue then solve them
def crawl():
    queued_links = file_to_set(QUEUE_FILE)
    if len(queued_links)>0:
        print(str(len(queued_links))+' links in the queue')
        create_jobs()

here crawl is called first. Sometimes while crawling the page it shows maximum recurssion depth exceeded whereas sometimes not. (I am running the same script again). Can someone explain me what's the problem?

Please note that the number of links that I need to crawl is around 100 only which is less than the limit for python.

Kevin Pandya
  • 1,036
  • 1
  • 9
  • 12

2 Answers2

1

Your function crawl calls create_jobs that calls again crawl. So, if you're not sure of your stop condition ( len(queued_links == 0) you may enter in an infinite loop, or reach the Python recursion limit.

loutre
  • 874
  • 8
  • 16
1

In create_jobs you are calling crawl, that would be okay if it was just that. But since you are also calling create_jobs from crawl, you can possibly enter an infinite loop. If you had not the condition len(queued_links) > 0, it would be an infinite loop. To prevent such problems (avoiding stack overflow), python has a recursion limit (see : What is the maximum recursion depth in Python, and how to increase it?).

The thing here is that a webpage is quite likely to contain links to other webpages, so your condition to stop the loop will not occur too often. That's why you are hitting the recursion limit. You can increase this limit by doing the following (snippet of code taken here : Python: Maximum recursion depth exceeded), but I would not advise you to do that:

import sys
sys.setrecursionlimit(10000) # 10000 is an example, try with different values

The good approach to solve this problem would be to change the design of your algorithm to something like this (basically you iterate over an array that you populate while crawling, instead of making recursive calls):

def crawl(url):
    return [url+'a', url+'b']

links = ['foo', 'bar']
for link in links:
    links.extend(crawl(link))

Regarding the fact that sometimes your algorithm works and sometimes not, it is quite likely that the pages change over time, if you are really close to the recursion limit it might me that depending on which pages were generated you hit that limit or not.

Finally, it's not because you have only 100 links that you can't hit a recursion limit of 1000 for instance. Indeed, for instance your crawl function will call other functions, etc. ... some recursions are hidden.

Community
  • 1
  • 1
Pholochtairze
  • 1,836
  • 1
  • 14
  • 18
  • Manipulating the `links` variable inside a loop which is iterating over it is not good form. A more tractable approach would be `while links:` and then remove an item from the head of the list when you start to process it, and add new items to the end. Abstracting this to a proper queue would improve your program, and possibly make it distributable. – tripleee Mar 14 '16 at 09:40