0

I currently have a Scrapy crawler that runs once. I've been searching for a solution to have it continuously repeat its crawling cycle until it's stopped.

In other words, once the first iteration of the crawl completes, automatically start a second iteration without stopping the entire crawler, after that a third iteration, and so on. Also, perhaps running again after x seconds, although I'm unsure how the system would react in the case of the previous crawl process not finishing while also trying to launch another iteration.

Solutions I've found online thus far only refer to cron or scrapyd which I'm not interested in. I'm more interested in implementing a custom scheduler within the crawler project using processes such as CrawlerRunner or reactors. Does anyone have a couple pointers?

The following code from another stackoverflow question is the closest information I found in regard to my questions, but am looking for some advice on how to implement a more continuous approach.

+ from twisted.internet import reactor, defer
+ from scrapy.crawler import CrawlerRunner
+ from scrapy.utils.log import configure_logging

+ def run_crawl():
+     """
+     Run a spider within Twisted. Once it completes,
+     wait 5 seconds and run another spider.
+     """
+     runner = CrawlerRunner(get_project_settings())
+    runner.crawl(SpiderA)
+     runner.crawl(SpiderB)
+     deferred = runner.join()
+     deferred.addCallback(reactor.callLater, 5, run_crawl)
+     return deferred

+ run_crawl()
+ reactor.run()

Error: "message": "Module 'twisted.internet.reactor' has no 'run' member", "source": "pylint",

UPDATE How to schedule Scrapy crawl execution programmatically

Tried to implement this but am unable to import my spider, I get module not found error. Also the reactor variables are red with error and say Module 'twisted.internet.reactor' has no 'callLater' member//////or has no 'run' member.

buklaou
  • 25
  • 1
  • 9
  • You failed to link to the question you copied the code from, but it is **very suspicious** that your `run_crawl` returns a `deferred` that gets thrown away – mdaniel Jan 21 '19 at 08:25
  • https://stackoverflow.com/questions/44228851/scrapy-on-a-schedule is the link – buklaou Jan 21 '19 at 08:48
  • Regarding the deferred, it is not thrown away because the reactor already has it queued by the time the funtion ends. The return statement, hence, is what should not be there in the first place, I think. – Gallaecio Jan 21 '19 at 13:05
  • if you want to run on schedual, you need `cron` – tim Jan 21 '19 at 21:55
  • I'm more interested in having it restart it's original url requests over once it's been completed like I stated in the original post. So basically, start requests -> parse -> start requests again -> parse, without stopping the spider until I manually do it, any thoughts? – buklaou Jan 22 '19 at 06:55

2 Answers2

0

Unless you ellaborate on what you mean by “more continuous”, the only way I can think of to make the code of the quoted response more continuous is to replace 5 with 0 in the deferred.

Gallaecio
  • 3,620
  • 2
  • 25
  • 64
0

Use apscheduler

# -*- coding: utf-8 -*-
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from apscheduler.schedulers.twisted import TwistedScheduler

from Demo.spiders.google import GoogleSpider # your spider

process = CrawlerProcess(get_project_settings())
scheduler = TwistedScheduler()
scheduler.add_job(process.crawl, 'interval', args=[GoogleSpider], seconds=10)
scheduler.start()
process.start(False)

samuel161
  • 221
  • 3
  • 2
  • 2
    When answering an old question, or any question in general, it would be great if you could provide some context to your answer rather than mostly code. – David Buck Nov 24 '19 at 10:50