I want to scrape review data from a website using scrapy. The code is given below.
The problem is that each time the program goes to the next page, it starts at the beginning (due to the callback) and it resets records[]
. So the array will be empty again and every review that is saved in records[]
is lost. This results in that when I open my csv file, I only get the reviews of the last page.
What I want is that all the data is stored in my csv file, so that records[]
does not keep resetting each time a next page is requested. I can't put the line: records = []
before the parse method, because than the array is not defined.
Here is my code:
def parse(self, response):
records = []
for r in response.xpath('//div[contains(@class, "a-section review")]'):
rtext = r.xpath('.//div[contains(@class, "a-row review-data")]').extract_first()
rating = r.xpath('.//span[contains(@class, "a-icon-alt")]/text()').extract_first()
votes = r.xpath('normalize-space(.//span[contains(@class, "review-votes")]/text())').extract_first()
if not votes:
votes = "none"
records.append((rating, votes, rtext))
print(records)
nextPage = response.xpath('//li[contains(@class, "a-last")]/a/@href').extract_first()
if nextPage:
nextPage = response.urljoin(nextPage)
yield scrapy.Request(url = nextPage)
import pandas as pd
df = pd.DataFrame(records, columns=['rating' , 'votes', 'rtext'])
df.to_csv('ama.csv', sep = '|', index =False, encoding='utf-8')