2

I was wondering if there is a way to restart a scrapy crawler. This is what my code looks like:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess

results = set([])

class SitemapCrawler(CrawlSpider):

name = "Crawler"
start_urls = ['www.example.com']
allowed_domains = ['www.example.com']
rules = [Rule(LinkExtractor(), callback='parse_links', follow=True)]

def parse_links(self, response):
    href = response.xpath('//a/@href').getall()
    results.add(response.url)
    for link in href:
        results.add(link)

def start():
   process.crawl(Crawler)
   process.start()
   for link in results:
      print(link)

If I try calling start() twice it runs it once than gives me this error:

raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

I know this is a general question, so I don't expect any code but I just want to know how I can fix this issue. Thanks in advance.

MeZo
  • 35
  • 5
  • what do you mean by restarting? Do you want to run two instances of crawler running simultaneously or restart the crawler after somehow stopped. – ibilgen Dec 30 '20 at 23:07
  • @ibilgen, I mean run the crawler the first time until it finishes and then run it again a second time. – MeZo Dec 31 '20 at 12:32
  • I recommend you to start the crawler using an independent script, then you can start the crawler as you wish. – ibilgen Dec 31 '20 at 12:50

1 Answers1

1
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider(scrapy.Spider):
        #Spider definition
        configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
        runner = CrawlerRunner()
        d = runner.crawl(MySpider)
        def finished():            
            print("finished :D") 
        d.addCallback(finished)
        reactor.run() 
mtabbasi
  • 96
  • 1
  • 5
  • is there possibly a way where I can make the crawler stay on after finishing crawling. What I want to do is: Crawl a url, then wait until a different url is added and then crawl that one and so on. I don't want to directly restart it. I'm using it for an API. – MeZo Jan 03 '21 at 15:17
  • You can't do that with just one request. You can check for url changes with sending request at regular interval .For that change callback second `d.addCallback(sleep, seconds=) # call back in second ` – mtabbasi Jan 05 '21 at 15:23
  • @mtabbsi I found a question similar to mine: https://stackoverflow.com/questions/65522335/why-does-scrapy-crawler-only-work-once-in-flask-app. If you can answer that question, that would be great. – MeZo Jan 05 '21 at 17:07
  • @MeZo Checkout my answer – mtabbasi Jan 06 '21 at 14:31
  • Thanks for the solution, it workes. Is there any possible way that I can run code after reactor.run()? – MeZo Jan 06 '21 at 17:51
  • @mtabbsi Can you answer this question: https://stackoverflow.com/questions/65605769/is-there-a-way-to-run-code-after-reactor-run-in-scrapy? – MeZo Jan 07 '21 at 03:23
  • @MeZo Check it out – mtabbasi Jan 07 '21 at 09:20