Scrapy gets stuck crawling a long list of urls

Question

I am scraping a large list of urls (1000-ish) and after a set time the crawler gets stuck with crawling 0 pages/min. The problem always occurs at the same spot when crawling. The list of urls is retrieved from a MySQL database. I am fairly new to python and scrapy so I don't know where to start debugging, and I fear that due to my inexperience the code itself is also a bit of a mess. Any pointers to where the issue lies are appreciated.

I used to retrieve the entire list of urls in one go, and the crawler worked fine. However I had problems with writing the results back into the database and I didn't want to read the whole large list of urls into the memory, so I changed it to iterate through the database one url at a time, where the problem occurred. I am fairly certain the url itself isn't the issue, because when I try to start the crawling from the problem url, it works without issue, getting stuck further down the line in a different, but consistent spot.

The relevant parts of the code are as follow. Note that the script is supposed to be run as a standalone script, which is why I define the necessary settings in the spider itself.

class MySpider(CrawlSpider):
    name = "mySpider"
    item = []
    #spider settings
    custom_settings = {
        'CONCURRENT_REQUESTS': 1,
        'DEPTH_LIMIT': 1,
        'DNS_TIMEOUT': 5,
        'DOWNLOAD_TIMEOUT':5,
        'RETRY_ENABLED': False,
        'REDIRECT_MAX_TIMES': 1
    }


    def start_requests(self):

        while i < n_urls:
            urllist = "SELECT url FROM database WHERE id=" + i
            cursor = db.cursor()
            cursor.execute(urllist)
            urls = cursor.fetchall()
            urls = [i[0] for i in urls] #fetch url from inside list of tuples
            urls = str(urls[0]) #transform url into string from list
            yield Request(urls, callback=self.parse, errback=self.errback)

    def errback(self, failure):
        global i
        sql = "UPDATE db SET item = %s, scrape_time = now() WHERE id = %s"
        val = ('Error', str(j))
        cursor.execute(sql, val)
        db.commit()
        i += 1


    def parse(self, response):
        global i
        item = myItem()
        item["result"] = response.xpath("//item to search")
        if item["result"] is None or len(item["result"]) == 0:
            sql = "UPDATE db SET, item = %s, scrape_time = now() WHERE id = %s"
            val = ('None', str(i))
            cursor.execute(sql, val)
            db.commit()
            i += 1
        else:
            sql = "UPDATE db SET item = %s, scrape_time = now() WHERE id = %s"
            val = ('Item', str(i))
            cursor.execute(sql, val)
            db.commit()
            i += 1

The scraper gets stuck showing the following message:

2019-01-14 15:10:43 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET someUrl> from <GET anotherUrl>
2019-01-14 15:11:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 9 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:12:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:13:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:14:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:15:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:16:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

Everything works fine up until this point. Any help you could give me is appreciated!

Everything that was crawled was put into the database correctly. After the crawling stops, the updates stop as well. — I. K., Jan 17 '19 at 08:36
Yes, of course you need to save changes to the db when spider close. `https://stackoverflow.com/questions/12394184/scrapy-call-a-function-when-a-spider-quits` Here you can find how to add a `spider_close` function — ThunderMind, Jan 17 '19 at 09:33

score 1 · Answer 1 · answered Jan 14 '19 at 16:16

1

The reason scrapy syas 0 item is that it counts the yielded data while you are not yielding anything but inserting in your database.

answered Jan 14 '19 at 16:16

ThunderMind

789
5
14

2

That explains the data, which is not a problem, but why it stops crawling still escapes me - do you have any idea what could cause it? – I. K. Jan 17 '19 at 08:35

score 0 · Answer 2 · answered Nov 14 '20 at 22:39

I just had this happen to me, so I wanted to share what caused the bug, in case someone encounters the exact same issue.

Apparently, if you don't specify a callback for a Request, it defaults to the spider's parse method as a callback (my intention was to not have a callback at all for those requests).

In my spider, I used the parse method to make most of the Requests, so this behavior caused many unnecessary requests that eventually led to Scrapy crashing. Simply adding an empty callback function (lambda a: None) for those requests solved my issue.

Scrapy gets stuck crawling a long list of urls

2 Answers2

Linked