Scrape pages periodically with Scrapy

Question

I have a program in which I periodically scrape and check for something within some domains. This means I need to run a spider, get the results, and run it again after an indefinite amount of time. The problem here is once I run the spider with the code below, I can't run it again, since the twisted reactor can't be restarted.

process = CrawlerProcess(some_settings)
process.crawl(myspider)
process.start()

So, what are my options in running spiders in this way?

Afaik scrapy is single threaded and has an async control flow, so sleeping the main thread might work while you are looping the crawling processes. — gunesevitan, Aug 10 '19 at 18:21
you want to repeat this process periodically (like everyday or every hour) or directly when it finishes ? — hadesfv, Aug 10 '19 at 18:35
@hadesfv The process will be repeated periodically, but it might also be triggered manually. — C.Acarbay, Aug 12 '19 at 16:49
@gunesevitan Could you open that up a little more for a scrapy noob please? — C.Acarbay, Aug 12 '19 at 16:50
Can you use a cronjob to trigger it periodically? (I do it like this in a bit more advanced way, because I have many spiders, but in general for me it comes down to cronjobs to run my spiders automatically from time to time). — aufziehvogel, Aug 12 '19 at 21:19
If you really need to restart it in the same process have a look at the documentation for [Common Practices](https://docs.scrapy.org/en/latest/topics/practices.html): "It’s recommended you use CrawlerRunner instead of CrawlerProcess if your application is already using Twisted and you want to run Scrapy in the same reactor." — aufziehvogel, Aug 12 '19 at 21:21

Scrape pages periodically with Scrapy

0 Answers0