Is there a way to restart a scrapy crawler?

Question

I was wondering if there is a way to restart a scrapy crawler. This is what my code looks like:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess

results = set([])

class SitemapCrawler(CrawlSpider):

name = "Crawler"
start_urls = ['www.example.com']
allowed_domains = ['www.example.com']
rules = [Rule(LinkExtractor(), callback='parse_links', follow=True)]

def parse_links(self, response):
    href = response.xpath('//a/@href').getall()
    results.add(response.url)
    for link in href:
        results.add(link)

def start():
   process.crawl(Crawler)
   process.start()
   for link in results:
      print(link)

If I try calling start() twice it runs it once than gives me this error:

raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

I know this is a general question, so I don't expect any code but I just want to know how I can fix this issue. Thanks in advance.

what do you mean by restarting? Do you want to run two instances of crawler running simultaneously or restart the crawler after somehow stopped. — ibilgen, Dec 30 '20 at 23:07
@ibilgen, I mean run the crawler the first time until it finishes and then run it again a second time. — MeZo, Dec 31 '20 at 12:32
I recommend you to start the crawler using an independent script, then you can start the crawler as you wish. — ibilgen, Dec 31 '20 at 12:50

mtabbasi · Accepted Answer · 2021-01-07T09:00:48.550

1

from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider(scrapy.Spider):
        #Spider definition
        configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
        runner = CrawlerRunner()
        d = runner.crawl(MySpider)
        def finished():            
            print("finished :D") 
        d.addCallback(finished)
        reactor.run()

edited Jan 07 '21 at 09:00

answered Jan 02 '21 at 10:13

mtabbasi

96
1
5

is there possibly a way where I can make the crawler stay on after finishing crawling. What I want to do is: Crawl a url, then wait until a different url is added and then crawl that one and so on. I don't want to directly restart it. I'm using it for an API. – MeZo Jan 03 '21 at 15:17
You can't do that with just one request. You can check for url changes with sending request at regular interval .For that change callback second `d.addCallback(sleep, seconds=) # call back in second ` – mtabbasi Jan 05 '21 at 15:23
@mtabbsi I found a question similar to mine: https://stackoverflow.com/questions/65522335/why-does-scrapy-crawler-only-work-once-in-flask-app. If you can answer that question, that would be great. – MeZo Jan 05 '21 at 17:07
@MeZo Checkout my answer – mtabbasi Jan 06 '21 at 14:31
Thanks for the solution, it workes. Is there any possible way that I can run code after reactor.run()? – MeZo Jan 06 '21 at 17:51
@mtabbsi Can you answer this question: https://stackoverflow.com/questions/65605769/is-there-a-way-to-run-code-after-reactor-run-in-scrapy? – MeZo Jan 07 '21 at 03:23
@MeZo Check it out – mtabbasi Jan 07 '21 at 09:20

Is there a way to restart a scrapy crawler?

1 Answers1