My scrapy code goes like this:
import scrapy
from scrapy.crawler import CrawlerProcess
class MovieSpider(scrapy.Spider):
name = "movies"
start_urls = [
'https://movie.douban.com/subject/25934014/',
'https://movie.douban.com/subject/25852314/',
]
def parse(self, response):
title = response.css('div#wrapper div#content h1 span::text').extract_first()
year = response.css('div#wrapper div#content h1 span.year::text').extract_first()
yield {
'url': response.url,
'title': title,
'year': year,
}
And I run it like this
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'FEED_FORMAT': 'json',
'FEED_URI': 'movie.json',
'FEED_EXPORT_ENCODING':'utf-8'
})
process.crawl(MovieSpider)
process.start() #
which is the recommanded way in docs.
The problem is that, after I run the above script, I can't run it again. Jupyter notebook returns the error ReactorNotRestartable
If I re-start the kernel in jupyter, it will be fine to run for the first time.
I think the problem is stated in Scrapy crawl from script always blocks script execution after scraping
and I might be possible to solve this problem by using their code. However, their code is quite complex for such a small thing, and far from the CrawlerProcess
way which is recommended in the docs.
I'm wondering if there is better way to solve this problem?
I tried adding process.stop()
at the end of script. It didn't help.