0

I have a program in which I periodically scrape and check for something within some domains. This means I need to run a spider, get the results, and run it again after an indefinite amount of time. The problem here is once I run the spider with the code below, I can't run it again, since the twisted reactor can't be restarted.

process = CrawlerProcess(some_settings)
process.crawl(myspider)
process.start()

So, what are my options in running spiders in this way?

C.Acarbay
  • 424
  • 5
  • 17
  • Afaik scrapy is single threaded and has an async control flow, so sleeping the main thread might work while you are looping the crawling processes. – gunesevitan Aug 10 '19 at 18:21
  • you want to repeat this process periodically (like everyday or every hour) or directly when it finishes ? – hadesfv Aug 10 '19 at 18:35
  • @hadesfv The process will be repeated periodically, but it might also be triggered manually. – C.Acarbay Aug 12 '19 at 16:49
  • @gunesevitan Could you open that up a little more for a scrapy noob please? – C.Acarbay Aug 12 '19 at 16:50
  • Can you use a cronjob to trigger it periodically? (I do it like this in a bit more advanced way, because I have many spiders, but in general for me it comes down to cronjobs to run my spiders automatically from time to time). – aufziehvogel Aug 12 '19 at 21:19
  • If you really need to restart it in the same process have a look at the documentation for [Common Practices](https://docs.scrapy.org/en/latest/topics/practices.html): "It’s recommended you use CrawlerRunner instead of CrawlerProcess if your application is already using Twisted and you want to run Scrapy in the same reactor." – aufziehvogel Aug 12 '19 at 21:21

0 Answers0