Scrapy: fail to re-run in Jupyter Notebook script, reporting ReactorNotRestartable

Question

My scrapy code goes like this:

import scrapy
from scrapy.crawler import CrawlerProcess

class MovieSpider(scrapy.Spider):
    name = "movies"
    start_urls = [
        'https://movie.douban.com/subject/25934014/',
        'https://movie.douban.com/subject/25852314/',
    ]

    def parse(self, response):
        title = response.css('div#wrapper div#content h1 span::text').extract_first()
        year = response.css('div#wrapper div#content h1 span.year::text').extract_first()
        yield {
            'url': response.url,
            'title': title,
            'year': year,
        }

And I run it like this

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'FEED_FORMAT': 'json',
    'FEED_URI': 'movie.json',
    'FEED_EXPORT_ENCODING':'utf-8'
})

process.crawl(MovieSpider)
process.start() #

which is the recommanded way in docs.

The problem is that, after I run the above script, I can't run it again. Jupyter notebook returns the error ReactorNotRestartable

If I re-start the kernel in jupyter, it will be fine to run for the first time.

I think the problem is stated in Scrapy crawl from script always blocks script execution after scraping

and I might be possible to solve this problem by using their code. However, their code is quite complex for such a small thing, and far from the CrawlerProcess way which is recommended in the docs.

I'm wondering if there is better way to solve this problem?

I tried adding process.stop() at the end of script. It didn't help.

Twisted uses a ton of globals which will be bound to the notebook process and aren't cleaned up. Meaning you can only run the crawlers once, you need to find another way to run the crawlers — Rafael Almeida, Feb 15 '17 at 10:42
Try to use CrawlerRunner recipe from the docs adding `reactor.stop()` at the end. — mizhgun, Feb 15 '17 at 12:08
I am facing the same issue, and I guess you should start the process like crawler_process = CrawlerProcess(yoursettings) crawler_process.start() — user299791, Jan 24 '19 at 22:18
Possible duplicate of [Scrapy - Reactor not Restartable](https://stackoverflow.com/questions/41495052/scrapy-reactor-not-restartable) — Domi, Feb 07 '19 at 12:52

Scrapy: fail to re-run in Jupyter Notebook script, reporting ReactorNotRestartable

0 Answers0