Web-Scraping: moving to next pages using Scrapy for getting all data

Question

I would need to scrape all the reviews from a product on Amazon:

https://www.amazon.com/Cascade-ActionPacs-Dishwasher-Detergent-Packaging/dp/B01NGTV4J5/ref=pd_rhf_cr_s_trq_bnd_0_6/130-6831149-4603948?_encoding=UTF8&pd_rd_i=B01NGTV4J5&pd_rd_r=b6f87690-19d7-4dba-85c0-b8f54076705a&pd_rd_w=AgonG&pd_rd_wg=GG9yY&pf_rd_p=4e0a494a-50c5-45f5-846a-abfb3d21ab34&pf_rd_r=QAD0984X543RFMNNPNF2&psc=1&refRID=QAD0984X543RFMNNPNF2

I am using Scrapy to do this. However it seems that the following code is not scraping all the reviews, as they are split n different pages. A human should click on all reviews first, the click on next page. I am wondering how I could do this using scrapy or a different tool in python. There are 5893 reviews for this product and I cannot get this information manually.

Currently my code is the following:

import scrapy
from scrapy.crawler import CrawlerProcess

class My_Spider(scrapy.Spider):
    name = 'spid'
    start_urls = ['https://www.amazon.com/Cascade-ActionPacs-Dishwasher-Detergent-Packaging/dp/B01NGTV4J5/ref=pd_rhf_cr_s_trq_bnd_0_6/130-6831149-4603948?_encoding=UTF8&pd_rd_i=B01NGTV4J5&pd_rd_r=b6f87690-19d7-4dba-85c0-b8f54076705a&pd_rd_w=AgonG&pd_rd_wg=GG9yY&pf_rd_p=4e0a494a-50c5-45f5-846a-abfb3d21ab34&pf_rd_r=QAD0984X543RFMNNPNF2&psc=1&refRID=QAD0984X543RFMNNPNF2']

    def parse(self, response):
        for row in response.css('div.review'):
            item = {}

            item['author'] = row.css('span.a-profile-name::text').extract_first()

            rating = row.css('i.review-rating > span::text').extract_first().strip().split(' ')[0]
            item['rating'] = int(float(rating.strip().replace(',', '.')))

            item['title'] = row.css('span.review-title > span::text').extract_first()
            yield item

And to execute the crawler:

process = CrawlerProcess({
})

process.crawl(My_Spider)
process.start()

Can you tell me if it is possible to move to next pages and scrape all the reviews? This should be the page where are stored the reviews.

Felix Eklöf · Answer 1 · 2020-08-03T14:50:00.050

0

With the url https://www.amazon.com/Cascade-ActionPacs-Dishwasher-Detergent-Packaging/product-reviews/B01NGTV4J5/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=<PUT PAGE NUMBER HERE> you could do something like this:

import scrapy
from scrapy.crawler import CrawlerProcess

class My_Spider(scrapy.Spider):
    name = 'spid'
    start_urls = ['https://www.amazon.com/Cascade-ActionPacs-Dishwasher-Detergent-Packaging/product-reviews/B01NGTV4J5/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=1']

    def parse(self, response)
         for row in response.css('div.review'):
             item = {}
             item['author'] = row.css('span.a-profile-name::text').extract_first()
             rating = row.css('i.review-rating > span::text').extract_first().strip().split(' ')[0]
             item['rating'] = int(float(rating.strip().replace(',', '.')))
             item['title'] = row.css('span.review-title > span::text').extract_first()
             yield item
         next_page = response.css('ul.a-pagination > li.a-last > a::attr(href)').get()
         yield scrapy.Request(url=next_page))

edited Aug 03 '20 at 14:50

answered Aug 03 '20 at 14:31

Felix Eklöf

3,253
2
10
27

First of all, thank you for your answer. Unfortunately I am getting this error: `ReactorNotRestartable: ` with no further information about the error. This is caused by `---> 33 process.start()`, then `--> 293 reactor.run(installSignalHandlers=False) # blocking call` – Aug 03 '20 at 14:45
Hm, did you replace the item = ReviewItem() and the ... part with the your item and code? – Felix Eklöf Aug 03 '20 at 14:48
1

I replaced the `item = ReviewItem()` and the `...` part with: ` item = {} item['author'] = row.css('span.a-profile-name::text').extract_first() rating = row.css('i.review-rating > span::text').extract_first().strip().split(' ')[0] item['rating'] = int(float(rating.strip().replace(',', '.'))) item['title'] = row.css('span.review-title > span::text').extract_first()` – Aug 03 '20 at 14:50
The problem is here: `process = CrawlerProcess({ }) process.crawl(My_Spider) process.start()`. when I try to execute the process and store data – Aug 03 '20 at 14:51
1

Have a look at this and see if any of that solves it: https://stackoverflow.com/questions/41495052/scrapy-reactor-not-restartable Also, you can try to change back the code, Does it work then? – Felix Eklöf Aug 03 '20 at 14:53
I have tried with the solution proposed there. But I am stil getting the same error, here ----> 6 run_spider(My_Spider) – Aug 03 '20 at 15:29

Web-Scraping: moving to next pages using Scrapy for getting all data

1 Answers1