9

Getting Scrapy to run on a schedule is driving me around the Twist(ed).

I thought the below test code would work, but I get a twisted.internet.error.ReactorNotRestartable error when the spider is triggered a second time:

from quotesbot.spiders.quotes import QuotesSpider
import schedule
import time
from scrapy.crawler import CrawlerProcess

def run_spider_script():
    process.crawl(QuotesSpider)
    process.start()


process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
})


schedule.every(5).seconds.do(run_spider_script)

while True:
    schedule.run_pending()
    time.sleep(1)

I'm going to guess that as part of the CrawlerProcess, the Twisted Reactor is called to start again, when that's not required and so the program crashes. Is there any way I can control this?

Also at this stage if there's an alternative way to automate a Scrapy spider to run on a schedule, I'm all ears. I tried scrapy.cmdline.execute , but couldn't get that to loop either:

from quotesbot.spiders.quotes import QuotesSpider
from scrapy import cmdline
import schedule
import time
from scrapy.crawler import CrawlerProcess


def run_spider_cmd():
    print("Running spider")
    cmdline.execute("scrapy crawl quotes".split())


process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
})


schedule.every(5).seconds.do(run_spider_cmd)

while True:
    schedule.run_pending()
    time.sleep(1)

EDIT

Adding code, which uses Twisted task.LoopingCall() to run a test spider every few seconds. Am I going about this completely the wrong way to schedule a spider that runs at the same time each day?

from twisted.internet import reactor
from twisted.internet import task
from scrapy.crawler import CrawlerRunner
import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):

        quotes = response.xpath('//div[@class="quote"]')

        for quote in quotes:

            author = quote.xpath('.//small[@class="author"]/text()').extract_first()
            text = quote.xpath('.//span[@class="text"]/text()').extract_first()

            print(author, text)


def run_crawl():

    runner = CrawlerRunner()
    runner.crawl(QuotesSpider)


l = task.LoopingCall(run_crawl)
l.start(3)

reactor.run()
itzafugazi
  • 121
  • 1
  • 1
  • 7
  • Why not simply use cron or systemd timers? – Granitosaurus May 28 '17 at 15:40
  • The web-scraping of data is only one part of the intended application, and I am hoping to have everything run as part of a single program. But yes, if I can't get this working as described, I will use an OS task scheduler to run the Scrapy script, with the rest of application running separately. – itzafugazi May 28 '17 at 16:15

2 Answers2

8

First noteworthy statement, there's usually only one Twisted reactor running and it's not restartable (as you've discovered). The second is that blocking tasks/functions should be avoided (ie. time.sleep(n)) and should be replaced with async alternatives (ex. 'reactor.task.deferLater(n,...)`).

To use Scrapy effectively from a Twisted project requires the scrapy.crawler.CrawlerRunner core API as opposed to scrapy.crawler.CrawlerProcess. The main difference between the two is that CrawlerProcess runs Twisted's reactor for you (thus making it difficult to restart the reactor), where as CrawlerRunner relies on the developer to start the reactor. Here's what your code could look like with CrawlerRunner:

from twisted.internet import reactor
from quotesbot.spiders.quotes import QuotesSpider
from scrapy.crawler import CrawlerRunner

def run_crawl():
    """
    Run a spider within Twisted. Once it completes,
    wait 5 seconds and run another spider.
    """
    runner = CrawlerRunner({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
        })
    deferred = runner.crawl(QuotesSpider)
    # you can use reactor.callLater or task.deferLater to schedule a function
    deferred.addCallback(reactor.callLater, 5, run_crawl)
    return deferred

run_crawl()
reactor.run()   # you have to run the reactor yourself
notorious.no
  • 4,919
  • 3
  • 20
  • 34
  • Thanks @notorious.no , this has begun clearing things up for me, but unfortunately I couldn't get this working on a schedule. I'm probably missing something obvious, but I don't see how I would implement this to run a spider at a specific time each day? The closest I can get is using Twisted `task.LoopingCall()`, which I could use to run a spider every 86400 seconds for a daily scrape, but am I going about this the wrong way? I've updated my post with the code for the loop, would really appreciate your guidance! – itzafugazi May 29 '17 at 21:27
  • 2
    LoopingCall will work fine and is the simplest solution. You could also modify the example code (ie. `addCallback(reactor.callLater, 5, run_crawl)`) and replace `5` with the number of seconds that represents when you want to scrape next. This will give you a bit more precision as opposed to `LoopingCall` – notorious.no Jun 02 '17 at 14:46
  • Thanks @notorious.no . I misunderstood what was happening with the `deferred.addCallback`, a bit of timestamping in debug and it's starting to make sense. This is finally going to work for me, thanks a lot for your help! – itzafugazi Jun 03 '17 at 23:35
  • @notorious.no I have a problem. Could you please help me? https://stackoverflow.com/questions/74072040/try-except-and-schedule-for-scrapy Thanks –  Oct 15 '22 at 10:53
7

You can use apscheduler

pip install apscheduler
# -*- coding: utf-8 -*-
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from apscheduler.schedulers.twisted import TwistedScheduler

from Demo.spiders.baidu import YourSpider

process = CrawlerProcess(get_project_settings())
scheduler = TwistedScheduler()
scheduler.add_job(process.crawl, 'interval', args=[YourSpider], seconds=10)
scheduler.start()
process.start(False)
samuel161
  • 221
  • 3
  • 2
  • This will not work with Django: the spider will open but not scrape or block the server's initialization. – Sy Ker Jan 14 '22 at 12:06