0

I have a scrapy+splash file to crawl data. Now I want to run my scrapy file by script so I use CrawlerProcess. My file is like this:

import scrapy
from scrapy_splash import SplashRequest
from scrapy.crawler import CrawlerProcess
class ProvinceSpider(scrapy.Spider):
    name = 'province'

    def start_requests(self):
        url = "https://e.vnexpress.net/covid-19/vaccine"

        yield SplashRequest(url=url,callback=self.parse)

    def parse(self, response):
        provinces = response.xpath("//div[@id='total_vaccine_province']/ul[@data-weight]")
        for province in provinces:
            yield{
                'province_name':province.xpath(".//li[1]/text()").get(),
                'province_population':province.xpath(".//li[2]/text()").get(),
                'province_expected_distribution':province.xpath(".//li[3]/text()").get(),
                'province_actual_distribution':province.xpath(".//li[4]/text()").get(),
                'province_distribution_percentage':province.xpath(".//li[5]/div/div/span/text()").get(),
            }

process = CrawlerProcess(settings={
    "FEEDS": {
        "province.json": {"format": "json"},
    },
})

process.crawl(ProvinceSpider)
process.start() # the script will block here until the crawling is finished

But when I run

python3 province.py

It doesn't connect to splash server thus can't crawl data. Any idea about which part I do wrong? Tks in advance

1 Answers1

1

Turns out the issue you actually experienced has been covered by the following answer here: Answer

A quick break-down (if you're not interested in the details):

Go to settings.py and add a USER-AGENT, in my case I left it as:

USER_AGENT = 'testit (http://www.yourdomain.com)'

Then run your crawler and it should work. Why? your scrapy is being blocked by the site.

Output:

2021-12-26 13:15:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://e.vnexpress.net/covid-19/vaccine>
{'province_name': 'HCMC', 'province_population': '7.2M', 'province_expected_distribution': '13.8M', 'province_actual_distribution': '14.6M', 'province_distribution_percentage': '100%'}
2021-12-26 13:15:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://e.vnexpress.net/covid-19/vaccine>
{'province_name': 'Hanoi', 'province_population': '6.2M', 'province_expected_distribution': '11.4M', 'province_actual_distribution': '12.3M', 'province_distribution_percentage': '99,2%'}
2021-12-26 13:15:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://e.vnexpress.net/covid-19/vaccine>
{'province_name': 'Dong Nai', 'province_population': '2.4M', 'province_expected_distribution': '4.3M', 'province_actual_distribution': '5M', 'province_distribution_percentage': '100%'}
...
...

Here's my custom settings:

BOT_NAME = 'testing'
SPIDER_MODULES = ['testing.spiders']
NEWSPIDER_MODULE = 'testing.spiders'
SPLASH_URL = 'http://localhost:8050'
USER_AGENT = 'testing (http://www.yourdomain.com)'
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
   'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.2 Safari/605.1.15'
}
SPIDER_MIDDLEWARES = {
    'testing.middlewares.TestingSpiderMiddleware': 100,
}
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

joe_bill.dollar
  • 374
  • 1
  • 9
  • Do you run scrapy file by python3 province.py? I add USER_AGENT to my settings.py and run but don't get any result – Đỗ Quang Huy Dec 26 '21 at 15:09
  • @ĐỗQuangHuy I did `scrapy crawl province` in the terminal. However, if it's not working for you it's likely an issue with `splash`. You may have not set it up correctly. – joe_bill.dollar Dec 26 '21 at 15:20
  • Could u try to run by command python3 and show me your setting please? – Đỗ Quang Huy Dec 26 '21 at 15:36
  • I can run scrapy crawl province in the terminal successfully but can't with python3 script – Đỗ Quang Huy Dec 26 '21 at 16:09
  • Could you take a look at my problem more detail here: https://stackoverflow.com/questions/70477927/using-scrapy-script-in-flatpak-project Basically I want to run scrapy file by "scrapy crawl" but got a problem so now I want to run scrapy file from command "python3 ..." – Đỗ Quang Huy Dec 26 '21 at 16:12
  • @ĐỗQuangHuy I managed to get it to work with the script on my console - I use visual studio code. The trick was to set-up a vpn, and replace `SplashRequest` with `scrapy.Request`. Otherwise the website will keep trying to block you. – joe_bill.dollar Dec 26 '21 at 16:14