I'm trying to run a Scrapy spider from script using Celery periodic task.
Twisted==17.9.0
Scrapy==1.4.0
celery==4.1.0
I have a class SpiderSupervisor
which gets some data needed to run a spider and decides wether to run the spider at the moment.
The problem is that if I use the standard way:
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start()
It works first time but then it raises ReactorNotRestartable
.
So I tried another way using scrapyscript but it is initialized two times.
This way doesn't work either: Run a Scrapy spider in a Celery Task
There are no crawler.configure(), reactor.run(), crawler.start()
in scrapy and twisted:
from scrapy.crawler import Crawler
from twisted.internet import reactor
from billiard import Process # this can be from billiard.process import Process
tasks.py:
@periodic_task(run_every=timedelta(minutes=1))
def ping_spider():
SpiderSupervisor().send_signal()
SpiderSupervisor:
class SpiderSupervisor():
""" - Decides whether run spider now
- Sets last_hour_ping and hour in SystemScanningData
"""
def __init__(self): # TODO: exceptions?
self.system_scanning_data = SystemScanningData.objects.first()
...
def _get_new_system_scanning(self):
system_scanning = SystemScanning.objects.create()
return system_scanning
def send_signal(self):
self.system_scanning_data.update()
users = self.get_users_to_scan()
if users.exists():
urls_queryset = Url.objects.filter(product__user__in=users)
self.prepare_and_run_spider(urls_queryset)
def prepare_and_run_spider(self, urls_queryset):
system_scanning = self._get_new_system_scanning()
# spider = StilioMainSpider([1,2,3])
# job = Job(spider)
# Processor().run(job)
process = CrawlerProcess()
process.crawl(StilioMainSpider,[1,2,3])
process.start()
Do you know how to make this work? Is there another way? I need to pass arguments to the spider.