3

I've been trying to build a small scraper for ebay (college assignment). I already figured out most of it, but I ran into an issue with my loop.

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from loop.items import loopitems

class myProjectSpider(CrawlSpider):
name = 'looper'
allowed_domains = ['ebay.com']
start_urls = [l.strip() for l in open('bobo.txt').readlines()]

def __init__(self):
    service_args = ['--load-images=no',]
    self.driver = webdriver.PhantomJS(executable_path='/Users/localhost/desktop/.bin/phantomjs.cmd', service_args=service_args)

def parse(self, response):
    self.driver.get(response.url)
    item = loopitems()
    for abc in range(2,50):
        abc = str(abc)
        jackson = self.driver.execute_script("return !!document.evaluate('.//div[5]/div[2]/select/option[" + abc + "]', document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue;")
        if jackson == True:
             item['title'] = self.driver.execute_script("return document.evaluate('.//div[5]/div[2]/select/option[" + abc + "]', document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue.textContent;")
             yield item
        else:
             break

The urls (start_urls are dispatched from txt file):

http://www.ebay.com/itm/Mens-Jeans-Slim-Fit-Straight-Skinny-Fit-Denim-      Trousers-Casual-Pants-14-color-/221560999664?pt=LH_DefaultDomain_0&var=&hash=item3396108ef0
http://www.ebay.com/itm/New-Apple-iPad-3rd-Generation-16GB-32GB-or-64GB-WiFi-Retina-Display-Tablet-/261749018535?pt=LH_DefaultDomain_0&var=&hash=item3cf1750fa7

I'm running scrapy version 0.24.6 and phantomjs version 2.0. The objective is to go to the urls and extract the variations or attributes from the ebay form. The if statement right at the beginning of the loop is used to check if the element exists because selenium returns a bad header error if it can't find an element. I also loop the (yield item) because I need each variation on a new row. I use execute_script because it is 100 times faster than using seleniums get element by xpath.

The main problem I run into is the way scrapy returns my item results; if I use one url as my start_url it works like it should (it returns all items in a neat order). The second I add more urls to it I get a completly different result all my items are scrambled around and some items are returned multiple times it also happens to vary almost everytime. After countless testing I noticed yield item is causing some kind of problem; so I removed it and tried just printing the results and sure enough it returns them perfectly. I really need each item on a new row though, and the only way I got to do so is by using yield item (maybe there's a better way?).

As of now I have just copy pasted the looped code changing the xpath option manually. And it works like expected, but I really need to be able to loop through items in the future. If someone sees an error in my code or a better way to try it please tell me. All responses are helpful...

Thanks

1 Answers1

1

If I correctly understood what you want to do, I think this one could help you.

Scrapy Crawl URLs in Order

The problem is that start_urls are not processed in order. They are passed to start_requests method and returned with a downloaded response to parse method. This is asynchronous.

Maybe this helps

#Do your thing
start_urls = [open('bobo.txt').readlines()[0].strip()]
other_urls = [l.strip() for l in open('bobo.txt').readlines()[1:]]
other_urls.reverse()

#Do your thing
def parse(self, response):

    #Do your thing
    if len(self.other_urls) != 0
        url = self.other_urls.pop()
        yield Request(url=url, callback=self.parse)
Community
  • 1
  • 1
Bzisch
  • 101
  • 5
  • 1
    Thanks Bzisch your response helped put me on the right path. After trying your solution I was able to scrape the urls in order but some of my results were still unaccurate so changed the crawling order from DFO to BFO, and activated the Dupefilter_Debug (since some results were repeating). Now its working like a Charm. – therealdeal Jun 17 '15 at 16:55