I have a scrapy+splash file to crawl data. Now I want to run my scrapy file by script so I use CrawlerProcess. My file is like this:
import scrapy
from scrapy_splash import SplashRequest
from scrapy.crawler import CrawlerProcess
class ProvinceSpider(scrapy.Spider):
name = 'province'
def start_requests(self):
url = "https://e.vnexpress.net/covid-19/vaccine"
yield SplashRequest(url=url,callback=self.parse)
def parse(self, response):
provinces = response.xpath("//div[@id='total_vaccine_province']/ul[@data-weight]")
for province in provinces:
yield{
'province_name':province.xpath(".//li[1]/text()").get(),
'province_population':province.xpath(".//li[2]/text()").get(),
'province_expected_distribution':province.xpath(".//li[3]/text()").get(),
'province_actual_distribution':province.xpath(".//li[4]/text()").get(),
'province_distribution_percentage':province.xpath(".//li[5]/div/div/span/text()").get(),
}
process = CrawlerProcess(settings={
"FEEDS": {
"province.json": {"format": "json"},
},
})
process.crawl(ProvinceSpider)
process.start() # the script will block here until the crawling is finished
But when I run
python3 province.py
It doesn't connect to splash server thus can't crawl data. Any idea about which part I do wrong? Tks in advance