3

I am scraping a large list of urls (1000-ish) and after a set time the crawler gets stuck with crawling 0 pages/min. The problem always occurs at the same spot when crawling. The list of urls is retrieved from a MySQL database. I am fairly new to python and scrapy so I don't know where to start debugging, and I fear that due to my inexperience the code itself is also a bit of a mess. Any pointers to where the issue lies are appreciated.

I used to retrieve the entire list of urls in one go, and the crawler worked fine. However I had problems with writing the results back into the database and I didn't want to read the whole large list of urls into the memory, so I changed it to iterate through the database one url at a time, where the problem occurred. I am fairly certain the url itself isn't the issue, because when I try to start the crawling from the problem url, it works without issue, getting stuck further down the line in a different, but consistent spot.

The relevant parts of the code are as follow. Note that the script is supposed to be run as a standalone script, which is why I define the necessary settings in the spider itself.

class MySpider(CrawlSpider):
    name = "mySpider"
    item = []
    #spider settings
    custom_settings = {
        'CONCURRENT_REQUESTS': 1,
        'DEPTH_LIMIT': 1,
        'DNS_TIMEOUT': 5,
        'DOWNLOAD_TIMEOUT':5,
        'RETRY_ENABLED': False,
        'REDIRECT_MAX_TIMES': 1
    }


    def start_requests(self):

        while i < n_urls:
            urllist = "SELECT url FROM database WHERE id=" + i
            cursor = db.cursor()
            cursor.execute(urllist)
            urls = cursor.fetchall()
            urls = [i[0] for i in urls] #fetch url from inside list of tuples
            urls = str(urls[0]) #transform url into string from list
            yield Request(urls, callback=self.parse, errback=self.errback)

    def errback(self, failure):
        global i
        sql = "UPDATE db SET item = %s, scrape_time = now() WHERE id = %s"
        val = ('Error', str(j))
        cursor.execute(sql, val)
        db.commit()
        i += 1


    def parse(self, response):
        global i
        item = myItem()
        item["result"] = response.xpath("//item to search")
        if item["result"] is None or len(item["result"]) == 0:
            sql = "UPDATE db SET, item = %s, scrape_time = now() WHERE id = %s"
            val = ('None', str(i))
            cursor.execute(sql, val)
            db.commit()
            i += 1
        else:
            sql = "UPDATE db SET item = %s, scrape_time = now() WHERE id = %s"
            val = ('Item', str(i))
            cursor.execute(sql, val)
            db.commit()
            i += 1

The scraper gets stuck showing the following message:

2019-01-14 15:10:43 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET someUrl> from <GET anotherUrl>
2019-01-14 15:11:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 9 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:12:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:13:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:14:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:15:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:16:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

Everything works fine up until this point. Any help you could give me is appreciated!

I. K.
  • 65
  • 1
  • 6
  • did you notice your db after running crawler ? – ThunderMind Jan 14 '19 at 16:14
  • Everything that was crawled was put into the database correctly. After the crawling stops, the updates stop as well. – I. K. Jan 17 '19 at 08:36
  • Yes, of course you need to save changes to the db when spider close. `https://stackoverflow.com/questions/12394184/scrapy-call-a-function-when-a-spider-quits` Here you can find how to add a `spider_close` function – ThunderMind Jan 17 '19 at 09:33

2 Answers2

1

The reason scrapy syas 0 item is that it counts the yielded data while you are not yielding anything but inserting in your database.

ThunderMind
  • 789
  • 5
  • 14
  • 2
    That explains the data, which is not a problem, but why it stops crawling still escapes me - do you have any idea what could cause it? – I. K. Jan 17 '19 at 08:35
0

I just had this happen to me, so I wanted to share what caused the bug, in case someone encounters the exact same issue.

Apparently, if you don't specify a callback for a Request, it defaults to the spider's parse method as a callback (my intention was to not have a callback at all for those requests).

In my spider, I used the parse method to make most of the Requests, so this behavior caused many unnecessary requests that eventually led to Scrapy crashing. Simply adding an empty callback function (lambda a: None) for those requests solved my issue.

Royar
  • 611
  • 6
  • 21