1

I want to scrape review data from a website using scrapy. The code is given below.

The problem is that each time the program goes to the next page, it starts at the beginning (due to the callback) and it resets records[]. So the array will be empty again and every review that is saved in records[] is lost. This results in that when I open my csv file, I only get the reviews of the last page.

What I want is that all the data is stored in my csv file, so that records[] does not keep resetting each time a next page is requested. I can't put the line: records = [] before the parse method, because than the array is not defined.

Here is my code:

def parse(self, response):
    records = []

    for r in response.xpath('//div[contains(@class, "a-section review")]'):
        rtext = r.xpath('.//div[contains(@class, "a-row review-data")]').extract_first()                
        rating = r.xpath('.//span[contains(@class, "a-icon-alt")]/text()').extract_first()
        votes = r.xpath('normalize-space(.//span[contains(@class, "review-votes")]/text())').extract_first()

        if not votes:
            votes = "none"

        records.append((rating, votes, rtext))
        print(records)

    nextPage = response.xpath('//li[contains(@class, "a-last")]/a/@href').extract_first()
    if nextPage:
        nextPage = response.urljoin(nextPage)
        yield scrapy.Request(url = nextPage)    

    import pandas as pd
    df = pd.DataFrame(records, columns=['rating' , 'votes', 'rtext'])
    df.to_csv('ama.csv', sep = '|', index =False, encoding='utf-8')
vezunchik
  • 3,669
  • 3
  • 16
  • 25
scrapitnow
  • 13
  • 2

2 Answers2

1

Moving the record declaration to the method call will use a common gotcha in python outlined here in the python docs. However in this instance the weird behavior of instantiating lists in a method declaration will work in your favor.

Python’s default arguments are evaluated once when the function is defined, not each time the function is called (like it is in say, Ruby). This means that if you use a mutable default argument and mutate it, you will and have mutated that object for all future calls to the function as well.

def parse(self, response, records=[]):


    for r in response.xpath('//div[contains(@class, "a-section review")]'):
        rtext = r.xpath('.//div[contains(@class, "a-row review-data")]').extract_first()                
        rating = r.xpath('.//span[contains(@class, "a-icon-alt")]/text()').extract_first()
        votes = r.xpath('normalize-space(.//span[contains(@class, "review-votes")]/text())').extract_first()

        if not votes:
            votes = "none"

        records.append((rating, votes, rtext))
        print(records)

    nextPage = response.xpath('//li[contains(@class, "a-last")]/a/@href').extract_first()
    if nextPage:
        nextPage = response.urljoin(nextPage)
        yield scrapy.Request(url = nextPage)    

    import pandas as pd
    df = pd.DataFrame(records, columns=['rating' , 'votes', 'rtext'])
    df.to_csv('ama.csv', sep = '|', index =False, encoding='utf-8')

The above method is a little weird. A more general solution would be to simply use a global variable. Here is a post going over how to use globals.

PeterH
  • 858
  • 1
  • 6
  • 15
0

Here parse is a callback which is called every time again. Try to define records globally or call an appender function and call it to append values.

Also scrappy is capable to generate CSV itself. Here’s my little experiment with scraping - https://gist.github.com/lisitsky/c4aac52edcb7abfd5975be067face1bb

So you can load data to csv then pandas will read it.

Eugene Lisitsky
  • 12,113
  • 5
  • 38
  • 59