1

I'm trying to run a Scrapy spider from script using Celery periodic task.

Twisted==17.9.0
Scrapy==1.4.0
celery==4.1.0

I have a class SpiderSupervisor which gets some data needed to run a spider and decides wether to run the spider at the moment.

The problem is that if I use the standard way:

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() 

It works first time but then it raises ReactorNotRestartable.

So I tried another way using scrapyscript but it is initialized two times.

This way doesn't work either: Run a Scrapy spider in a Celery Task

There are no crawler.configure(), reactor.run(), crawler.start() in scrapy and twisted:

from scrapy.crawler import Crawler
from twisted.internet import reactor
from billiard import Process # this can be from billiard.process import Process

enter image description here My code:

tasks.py:

@periodic_task(run_every=timedelta(minutes=1))
def ping_spider():
    SpiderSupervisor().send_signal()

SpiderSupervisor:

class SpiderSupervisor():
    """ - Decides whether run spider now
        - Sets last_hour_ping and hour in SystemScanningData
    """

    def __init__(self):  # TODO: exceptions?
        self.system_scanning_data = SystemScanningData.objects.first()

    ...

    def _get_new_system_scanning(self):
        system_scanning = SystemScanning.objects.create()
        return system_scanning

    def send_signal(self):
        self.system_scanning_data.update()
        users = self.get_users_to_scan()
        if users.exists():
            urls_queryset = Url.objects.filter(product__user__in=users)
            self.prepare_and_run_spider(urls_queryset)

    def prepare_and_run_spider(self, urls_queryset):
        system_scanning = self._get_new_system_scanning()
        # spider = StilioMainSpider([1,2,3])
        # job = Job(spider)
        # Processor().run(job)
        process = CrawlerProcess()
        process.crawl(StilioMainSpider,[1,2,3])
        process.start()

Do you know how to make this work? Is there another way? I need to pass arguments to the spider.

Milano
  • 18,048
  • 37
  • 153
  • 353
  • Don't use this approach because the twisted reactor once stopped cannot be re-run. So best is to run a `scrapyd` daemon to run the scraper then call its api to schedule the scraper – Tarun Lalwani Sep 27 '17 at 09:43
  • Can I use SpiderSupervisor class when using scrapyd? I mean that SpiderSupervisor instance is starting spider and sends it parameter? – Milano Sep 27 '17 at 09:48
  • Yes you can do it that way. You need to use `scrapyd-client` packages. – Tarun Lalwani Sep 27 '17 at 11:50

0 Answers0