How to schedule Scrapy crawl execution programmatically

Question

I want to create a scheduler script to run the same spider multiple times in a sequence.

So far I got the following:

#!/usr/bin/python3
"""Scheduler for spiders."""
import time

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

from my_project.spiders.deals import DealsSpider


def crawl_job():
    """Job to start spiders."""
    settings = get_project_settings()
    process = CrawlerProcess(settings)
    process.crawl(DealsSpider)
    process.start() # the script will block here until the end of the crawl


if __name__ == '__main__':

    while True:
        crawl_job()
        time.sleep(30) # wait 30 seconds then crawl again

For now the first time the spider executes properly, then after the time delay, the spider starts up again but right before it would start scraping I get the following error message:

Traceback (most recent call last):
  File "scheduler.py", line 27, in <module>
    crawl_job()
  File "scheduler.py", line 17, in crawl_job
    process.start() # the script will block here until the end of the crawl
  File "/usr/local/lib/python3.5/dist-packages/scrapy/crawler.py", line 285, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 1193, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 1173, in startRunning
    ReactorBase.startRunning(self)
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 684, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

Unfortunately I'm not familiar with the Twisted framework and its Reactors, so any help would be appreciated!

notorious.no · Accepted Answer · 2017-12-01T22:16:51.723

You're getting the ReactorNotRestartable error because the Reactor cannot be started multiple times in Twisted. Basically, each time process.start() is called, it will try to start the reactor. There's plenty of information around the web about this. Here's a simple solution:

from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings

from my_project.spiders.deals import DealsSpider


def crawl_job():
    """
    Job to start spiders.
    Return Deferred, which will execute after crawl has completed.
    """
    settings = get_project_settings()
    runner = CrawlerRunner(settings)
    return runner.crawl(DealsSpider)

def schedule_next_crawl(null, sleep_time):
    """
    Schedule the next crawl
    """
    reactor.callLater(sleep_time, crawl)

def crawl():
    """
    A "recursive" function that schedules a crawl 30 seconds after
    each successful crawl.
    """
    # crawl_job() returns a Deferred
    d = crawl_job()
    # call schedule_next_crawl(<scrapy response>, n) after crawl job is complete
    d.addCallback(schedule_next_crawl, 30)
    d.addErrback(catch_error)

def catch_error(failure):
    print(failure.value)

if __name__=="__main__":
    crawl()
    reactor.run()

There are a few noticeable differences from your snippet. The reactor is directly called, substitute CrawlerProcess for CrawlerRunner, time.sleep has been removed so that the reactor doesn't block, the while loop has been replaced with a continuous call to the crawl function via callLater. It's short and should do what you want. If any parts confuse you, let me know and I'll elaborate.

UPDATE - Crawl at a specific time

import datetime as dt

def schedule_next_crawl(null, hour, minute):
    tomorrow = (
        dt.datetime.now() + dt.timedelta(days=1)
        ).replace(hour=hour, minute=minute, second=0, microsecond=0)
    sleep_time = (tomorrow - dt.datetime.now()).total_seconds()
    reactor.callLater(sleep_time, crawl)

def crawl():
    d = crawl_job()
    # crawl everyday at 1pm
    d.addCallback(schedule_next_crawl, hour=13, minute=30)

I ran the script like `python3 scheduler.py`, but it keeps idling and do nothing. What can be the problem? — Szabolcs, Dec 01 '17 at 12:09
It's difficult to know what the problem is without diving into your code. Put a `print` statements or break points in the functions and see where it idles. — notorious.no, Dec 01 '17 at 17:12
I've rerun the script from home, and now it raises an exception like this: `Unhandled error in Deferred: Traceback (most recent call last): File "scheduler.py", line 20, in crawl_job return runner.crawl(DealsSpider)` — Szabolcs, Dec 01 '17 at 17:29
I've updated the example to catch errors during a crawl. There's an exception in your spider, so debug `DealsSpider`. — notorious.no, Dec 01 '17 at 17:46
Oh yeah, now I see. Sorry it was my fault. With `CrawlerRunner` you have to be more explicit and have to define logging settings, otherwise it seems like it's idling and nothing happens. So yeah your answer is working. Other than that can you suggest some workaround to schedule the execution for a specific time of day, like `reactor.callAt('09:00', crawl)`? — Szabolcs, Dec 01 '17 at 18:43
Added an example of how to scrape at a specific time. In the example, it will crawl at 1:30 (13:30) of the following day. However, consider `cron` for scheduling tasks. — notorious.no, Dec 01 '17 at 22:17
thanks for the suggestions and the update! At the moment I use `cron` btw. I wanted to use `CrawlerRunner` with the `schedule` package, but it seems like I have to fall back to a self-made scheduler to avoid conflicts with `Twisted`. — Szabolcs, Dec 02 '17 at 09:17

How to schedule Scrapy crawl execution programmatically

1 Answers1

UPDATE - Crawl at a specific time

Linked