I currently have a Scrapy crawler that runs once. I've been searching for a solution to have it continuously repeat its crawling cycle until it's stopped.
In other words, once the first iteration of the crawl completes, automatically start a second iteration without stopping the entire crawler, after that a third iteration, and so on. Also, perhaps running again after x seconds, although I'm unsure how the system would react in the case of the previous crawl process not finishing while also trying to launch another iteration.
Solutions I've found online thus far only refer to cron or scrapyd which I'm not interested in. I'm more interested in implementing a custom scheduler within the crawler project using processes such as CrawlerRunner or reactors. Does anyone have a couple pointers?
The following code from another stackoverflow question is the closest information I found in regard to my questions, but am looking for some advice on how to implement a more continuous approach.
+ from twisted.internet import reactor, defer
+ from scrapy.crawler import CrawlerRunner
+ from scrapy.utils.log import configure_logging
+ def run_crawl():
+ """
+ Run a spider within Twisted. Once it completes,
+ wait 5 seconds and run another spider.
+ """
+ runner = CrawlerRunner(get_project_settings())
+ runner.crawl(SpiderA)
+ runner.crawl(SpiderB)
+ deferred = runner.join()
+ deferred.addCallback(reactor.callLater, 5, run_crawl)
+ return deferred
+ run_crawl()
+ reactor.run()
Error: "message": "Module 'twisted.internet.reactor' has no 'run' member", "source": "pylint",
UPDATE How to schedule Scrapy crawl execution programmatically
Tried to implement this but am unable to import my spider, I get module not found error. Also the reactor variables are red with error and say Module 'twisted.internet.reactor' has no 'callLater' member//////or has no 'run' member.