I am scraping a large list of urls (1000-ish) and after a set time the crawler gets stuck with crawling 0 pages/min. The problem always occurs at the same spot when crawling. The list of urls is retrieved from a MySQL database. I am fairly new to python and scrapy so I don't know where to start debugging, and I fear that due to my inexperience the code itself is also a bit of a mess. Any pointers to where the issue lies are appreciated.
I used to retrieve the entire list of urls in one go, and the crawler worked fine. However I had problems with writing the results back into the database and I didn't want to read the whole large list of urls into the memory, so I changed it to iterate through the database one url at a time, where the problem occurred. I am fairly certain the url itself isn't the issue, because when I try to start the crawling from the problem url, it works without issue, getting stuck further down the line in a different, but consistent spot.
The relevant parts of the code are as follow. Note that the script is supposed to be run as a standalone script, which is why I define the necessary settings in the spider itself.
class MySpider(CrawlSpider):
name = "mySpider"
item = []
#spider settings
custom_settings = {
'CONCURRENT_REQUESTS': 1,
'DEPTH_LIMIT': 1,
'DNS_TIMEOUT': 5,
'DOWNLOAD_TIMEOUT':5,
'RETRY_ENABLED': False,
'REDIRECT_MAX_TIMES': 1
}
def start_requests(self):
while i < n_urls:
urllist = "SELECT url FROM database WHERE id=" + i
cursor = db.cursor()
cursor.execute(urllist)
urls = cursor.fetchall()
urls = [i[0] for i in urls] #fetch url from inside list of tuples
urls = str(urls[0]) #transform url into string from list
yield Request(urls, callback=self.parse, errback=self.errback)
def errback(self, failure):
global i
sql = "UPDATE db SET item = %s, scrape_time = now() WHERE id = %s"
val = ('Error', str(j))
cursor.execute(sql, val)
db.commit()
i += 1
def parse(self, response):
global i
item = myItem()
item["result"] = response.xpath("//item to search")
if item["result"] is None or len(item["result"]) == 0:
sql = "UPDATE db SET, item = %s, scrape_time = now() WHERE id = %s"
val = ('None', str(i))
cursor.execute(sql, val)
db.commit()
i += 1
else:
sql = "UPDATE db SET item = %s, scrape_time = now() WHERE id = %s"
val = ('Item', str(i))
cursor.execute(sql, val)
db.commit()
i += 1
The scraper gets stuck showing the following message:
2019-01-14 15:10:43 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET someUrl> from <GET anotherUrl>
2019-01-14 15:11:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 9 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:12:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:13:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:14:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:15:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-14 15:16:08 [scrapy.extensions.logstats] INFO: Crawled 9 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
Everything works fine up until this point. Any help you could give me is appreciated!