I've been trying to build a small scraper for ebay (college assignment). I already figured out most of it, but I ran into an issue with my loop.
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from loop.items import loopitems
class myProjectSpider(CrawlSpider):
name = 'looper'
allowed_domains = ['ebay.com']
start_urls = [l.strip() for l in open('bobo.txt').readlines()]
def __init__(self):
service_args = ['--load-images=no',]
self.driver = webdriver.PhantomJS(executable_path='/Users/localhost/desktop/.bin/phantomjs.cmd', service_args=service_args)
def parse(self, response):
self.driver.get(response.url)
item = loopitems()
for abc in range(2,50):
abc = str(abc)
jackson = self.driver.execute_script("return !!document.evaluate('.//div[5]/div[2]/select/option[" + abc + "]', document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue;")
if jackson == True:
item['title'] = self.driver.execute_script("return document.evaluate('.//div[5]/div[2]/select/option[" + abc + "]', document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue.textContent;")
yield item
else:
break
The urls (start_urls are dispatched from txt file):
http://www.ebay.com/itm/Mens-Jeans-Slim-Fit-Straight-Skinny-Fit-Denim- Trousers-Casual-Pants-14-color-/221560999664?pt=LH_DefaultDomain_0&var=&hash=item3396108ef0
http://www.ebay.com/itm/New-Apple-iPad-3rd-Generation-16GB-32GB-or-64GB-WiFi-Retina-Display-Tablet-/261749018535?pt=LH_DefaultDomain_0&var=&hash=item3cf1750fa7
I'm running scrapy version 0.24.6 and phantomjs version 2.0. The objective is to go to the urls and extract the variations or attributes from the ebay form. The if statement right at the beginning of the loop is used to check if the element exists because selenium returns a bad header error if it can't find an element. I also loop the (yield item) because I need each variation on a new row. I use execute_script because it is 100 times faster than using seleniums get element by xpath.
The main problem I run into is the way scrapy returns my item results; if I use one url as my start_url it works like it should (it returns all items in a neat order). The second I add more urls to it I get a completly different result all my items are scrambled around and some items are returned multiple times it also happens to vary almost everytime. After countless testing I noticed yield item is causing some kind of problem; so I removed it and tried just printing the results and sure enough it returns them perfectly. I really need each item on a new row though, and the only way I got to do so is by using yield item (maybe there's a better way?).
As of now I have just copy pasted the looped code changing the xpath option manually. And it works like expected, but I really need to be able to loop through items in the future. If someone sees an error in my code or a better way to try it please tell me. All responses are helpful...
Thanks