0

I'm learning about NLP, and to do this I'm scraping an Amazon book-review using Scrapy. I've extracted the fields that I want, and am outputting them to a Json file format. When this file is loaded as a df, each field is recorded as a list rather than an individual line-per-line format. How can I split this list so that the df will have a row for each item, rather than all item entries being recorded in seperate lists? Code:

import scrapy


class ReviewspiderSpider(scrapy.Spider):
    name = 'reviewspider'
    allowed_domains = ['amazon.co.uk']
    start_urls = ['https://www.amazon.com/Gone-Girl-Gillian-Flynn/product-reviews/0307588378/ref=cm_cr_othr_d_paging_btm_1?ie=UTF8&reviewerType=all_reviews&pageNumber=1']

def parse(self, response):
    users = response.xpath('//a[contains(@data-hook, "review-author")]/text()').extract()
    titles = response.xpath('//a[contains(@data-hook, "review-title")]/text()').extract()
    dates = response.xpath('//span[contains(@data-hook, "review-date")]/text()').extract()
    found_helpful = response.xpath('//span[contains(@data-hook, "helpful-vote-statement")]/text()').extract()
    rating = response.xpath('//i[contains(@data-hook, "review-star-rating")]/span[contains(@class, "a-icon-alt")]/text()').extract()
    content = response.xpath('//span[contains(@data-hook, "review-body")]/text()').extract()

    yield {
        'users' : users.extract(),
        'titles' : titles.extract(),
        'dates' : dates.extract(),
        'found_helpful' : found_helpful.extract(),
        'rating' : rating.extract(),
        'content' : content.extract()
    }

Sample Output:

users = ['Lauren', 'James'...'John']
dates = ['on September 28, 2017', 'on December 26, 2017'...'on November 17, 2016']
rating = ['5.0 out of 5 stars', '2.0 out of 5 stars'...'5.0 out of 5 stars']

Desired Output:

index 1: [users='Lauren', dates='on September 28, 2017', rating='5.0 out of 5 stars']
index 2: [users='James', dates='On December 26, 2017', rating='5.0 out of 5 stars']
...

I know that the Pipeline related to the spider should probably be edited to achieve this, however I have limited Python knowledge and couldn't understand the Scrapy documentation. I've also tried the solutions from here and here, however I don't know enough to be able to consolidate the answers with my own code. Any help would be very appreciated.

Iguananaut
  • 21,810
  • 5
  • 50
  • 63
Laurie
  • 1,189
  • 1
  • 12
  • 28
  • If I understand correctly this isn't really a question specific to Scrapy, but an issue of understanding data structures and how to manipulate them. What you have is a collection of lists with one item in each list corresponding to a single attribute in a single record. You instead want to output a list of records. These two data structures are completely interchangeable. See for example https://stackoverflow.com/a/1663826/982257 – Iguananaut Jul 07 '18 at 20:49
  • The problem is that the lists aren't returned as tuples, so you would need to specify a delimiter. This can't be used however, as the delimiter is ", which is repeated numerous times throughout several columns (so not being used as a delimiter). – Laurie Jul 07 '18 at 21:27
  • 1
    Oh, you shouldn't be trying to parse the default output from scrapy. If you're worried about delimiters then you're going about it the wrong way. Indeed you could implement a pipeline as in [this answer](https://stackoverflow.com/a/47380905/982257) to customize how to output each item. But let's step back a bit: When you write "desired output" do you mean you literally want your output formatted like that? What is the purpose (in general) of your data and how will it be used? – Iguananaut Jul 07 '18 at 21:45
  • 1
    (As an aside, the code you posted is broken. The `def parse` should be indented, and you have duplicate `.extract()` calls.) – Iguananaut Jul 07 '18 at 21:48

2 Answers2

1

After re-reading your question I'm pretty sure this is what you want:

def parse(self, response):
    users = response.xpath('//a[contains(@data-hook, "review-author")]/text()').extract()
    titles = response.xpath('//a[contains(@data-hook, "review-title")]/text()').extract()
    dates = response.xpath('//span[contains(@data-hook, "review-date")]/text()').extract()
    found_helpful = response.xpath('//span[contains(@data-hook, "helpful-vote-statement")]/text()').extract()
    rating = response.xpath('//i[contains(@data-hook, "review-star-rating")]/span[contains(@class, "a-icon-alt")]/text()').extract()
    content = response.xpath('//span[contains(@data-hook, "review-body")]/text()').extract()

    for user, title, date, found_helpful, rating, content in zip(users, titles, dates, found_helpful, rating, content):
        yield {
            'user': user,
            'title': title,
            'date': date,
            'found_helpful': found_helpful,
            'rating': rating,
            'content': content
        }

or something to that effect. That's what I was trying to hint at in my first comment.

Iguananaut
  • 21,810
  • 5
  • 50
  • 63
  • Thanks for the help. Your code is giving me the error: 'SyntaxError: 'yield' outside function', unfortunately. I've managed to achieve what I wanted, however, by using the .css method instead of .xpath, which I'll post as an answer. – Laurie Jul 07 '18 at 22:41
  • *My* code isn't giving you that error. You haven't input it correctly--indentation matters in Python. – Iguananaut Jul 09 '18 at 08:32
0

EDIT: I was able to come up with the solution by using the .css method instead of .xpath. The spider I used for scraping shirt-listings from a fashion-retailer:

import scrapy
from ..items import ProductItem

class SportsdirectSpider(scrapy.Spider):
    name = 'sportsdirect'
    allowed_domains = ['www.sportsdirect.com']
    start_urls = ['https://www.sportsdirect.com/mens/mens-shirts']

def parse(self, response):
    products = response.css('.s-productthumbbox')
    for p in products:
        brand = p.css('.productdescriptionbrand::text').extract_first()
        name = p.css('.productdescriptionname::text').extract_first()
        price = p.css('.curprice::text').extract_first()
        item = ProductItem()
        item['brand'] = brand
        item['name'] = name
        item['price'] = price
        yield item

The related items.py script:

import scrapy

class ProductItem(scrapy.Item):
    name = scrapy.Field()
    brand = scrapy.Field()
    name = scrapy.Field()
    price = scrapy.Field()

Creation of a json-lines file (in Anaconda prompt):

>>> cd simple_crawler
>>> scrapy crawl sportsdirect --set FEED_URI=products.jl

The code used to turn the created .jl file into a dataframe:

import json
import pandas as pd
contents = open('products3.jl', "r").read() 
data = [json.loads(str(item)) for item in contents.strip().split('\n')]
df2 = pd.DataFrame(data)

Final output:

        brand        name                        price
0   Pierre Cardin    Short Sleeve Shirt Mens     £6.50 
1   Pierre Cardin    Short Sleeve Shirt Mens     £7.00 
...
Laurie
  • 1,189
  • 1
  • 12
  • 28
  • I don't see how this solves your question since it's not working on Amazon reviews. The use of the `.xpath()` method versus `.css()` doesn't really make a difference here--they are two different, but in many cases equivalent, syntaxes for extracting elements from a page. The issue is that after you extract the elements you want to return those elements in a *data structure* that matches the data structure you want for each item returned (via `yield`) by your `parse()` method. – Iguananaut Jul 09 '18 at 08:43
  • In your original question you yielded a *single* item, a single dictionary in which each value in the dictionary is a list. You instead want to yield multiple dictionaries, one for each item in those lists. That's what my answer does and that's also essentially what this answer does, in both cases by looping. – Iguananaut Jul 09 '18 at 08:45