0

I'm trying to scrape the website of a prominent UK retailer, using both Selenium and Scrapy (see code below). I'm getting a [scrapy.core.scraper] ERROR: Spider error processing and have no idea what else to do (been at it for three hours or so). Thank you for all your support!

import scrapy
from selenium import webdriver
from nl_scrape.items import NlScrapeItem
import time

class ProductSpider(scrapy.Spider):
    name = "product_spider"
    allowed_domains = ['newlook.com']
    start_urls = ['http://www.newlook.com/uk/womens/clothing/c/uk-womens-clothing?comp=NavigationBar%7Cmn%7Cwomens%7Cclothing#/?q=:relevance&page=1&sort=relevance&content=false']

def __init__(self):
    self.driver = webdriver.Safari()
    self.driver.set_window_size(800,600)
    time.sleep(4)

def parse(self, response):
    self.driver.get(response.url)
    time.sleep(4)

    # Collect products
    products = driver.find_elements_by_class_name('plp-item ng-scope')

    # Iterate over products; extract data and append individual features to NlScrapeItem
    for item in products:

        # Pull features
        desc = item.find_element_by_class_name('product-item__name link--nounderline ng-binding').text
        href = item.find_element_by_class_name('plp-carousel__img-link ng-scope').get_attribute('href')

        # Price Symbol removal and integer conversion
        priceString = item.find_element_by_class_name('price ng-binding').text
        priceInt = priceString.split('£')[1]
        price = float(priceInt)

        # Generate a product identifier
        identifier = href.split('/p/')[1].split('?comp')[0]
        identifier = int(identifier)

        # datetime
        dt = date.today()
        dt = dt.isoformat()

        # NlScrapeItem
        item = NlScrapeItem()

        # Append product to NlScrapeItem
        item['id'] = identifier
        item['href'] = href
        item['description'] = desc
        item['price'] = price
        item['firstSighted'] = dt
        item['lastSighted'] = dt
        yield item

    self.driver.close()

2017-08-26 15:48:38 [scrapy.core.scraper] ERROR: Spider error processing http://www.newlook.com/uk/womens/clothing/c/uk-womens-clothing?comp=NavigationBar%7Cmn%7Cwomens%7Cclothing#/?q=:relevance&page=1&sort=relevance&content=false> (referer: None)

Traceback (most recent call last): File "/Users/username/Documents/nl_scraping/nl_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/Users/username/Documents/nl_scraping/nl_scrape/nl_scrape/spiders/product_spider.py", line 18, in parse products = driver.find_elements_by_class_name('plp-item ng-scope') NameError: name 'driver' is not defined

Philipp
  • 21
  • 5
  • try using, products = self.driver.find_elements_by_class_name('plp-item ng-scope') and lets see if it works – Kapil Aug 26 '17 at 15:45
  • @Kapil: no luck unfortunately :( _ERROR: Spider error processing_ prevails – Philipp Aug 26 '17 at 16:01
  • did atleast your safari browser start? – Kapil Aug 26 '17 at 16:11
  • @Kapil: Yes it does - I previously played around with selenium in the console just to get a sense of it and every line of code works individually - just the scrapy implementation is struggling. Do you have any thoughts on using a middleware? [link](https://stackoverflow.com/questions/31174330/passing-selenium-response-url-to-scrapy) - I was thinking of the answer for this question. – Philipp Aug 26 '17 at 16:14
  • @Kapil: Thank you very much for all your help! works now thanks to the workaround below - your thinking was obviously extremely close given Tarun's answer is working! – Philipp Aug 26 '17 at 16:33

1 Answers1

2

So your code has two issues

def parse(self, response):
    self.driver.get(response.url)
    time.sleep(4)

    # Collect products
    products = driver.find_elements_by_class_name('plp-item ng-scope')

Very conveniently you changed self.driver to just driver. Doesn't work that way. You should add at the top of the function

def parse(self, response):
    driver = self.driver
    driver.get(response.url)
    time.sleep(4)

    # Collect products
    products = driver.find_elements_by_class_name('plp-item ng-scope')

Next you have used self.driver.close() at the end of the function. So you will close the browser once you process one url. That is wrong. So remove that line.

Tarun Lalwani
  • 142,312
  • 9
  • 204
  • 265
  • that was it - thank you very much - I can't up vote because my reputation is still too low! – Philipp Aug 26 '17 at 16:31
  • @Tarun Lalwani , although this is not my thread, I would be very happy if i could know where to put that line `self.driver.close()`. Thanks. – SIM Sep 23 '17 at 10:47
  • @Shahin, you would need to listen to the `spider_closed` signal and execute the code over there. There is a simple example on how to hook to that code here https://doc.scrapy.org/en/latest/topics/signals.html – Tarun Lalwani Sep 23 '17 at 10:49
  • Thanks a lot. You should have +1 from my end. – SIM Sep 23 '17 at 11:07
  • @ Tarun Lalwani, looks like I found a solution as to how I could close the webdriver when it is done. Just putting this after init method `def __del__(self):self.driver.close()` will do the trick. – SIM Sep 29 '17 at 09:29
  • @Shahin, thanks for sharing this. Will keep that solution in mind – Tarun Lalwani Sep 29 '17 at 09:32