2

I m trying to scrape a website that uses Ajax to load the different pages.
Although my selenium browser is navigating through all the pages, but scrapy response is still the same and it ends up scraping same response(no of pages times).

Proposed Solution :
I read in some answers that by using
hxs = HtmlXPathSelector(self.driver.page_source)
You can change the page source and then scrape. But it is not working ,also after adding this the browser also stopped navigating.

code

 def parse(self, response):
    self.driver.get(response.url)
    pages = (int)(response.xpath('//p[@class="pageingP"]/a/text()')[-2].extract())
    for i in range(pages):
        next = self.driver.find_element_by_xpath('//a[text()="Next"]')
        print response.xpath('//div[@id="searchResultDiv"]/h3/text()').extract()[0]
        try:
            next.click()
            time.sleep(3)
            #hxs = HtmlXPathSelector(self.driver.page_source)
            for sel in response.xpath("//tr/td/a"):
               item = WarnerbrosItem()
               item['url'] = response.urljoin(sel.xpath('@href').extract()[0])
               request = scrapy.Request(item['url'],callback=self.parse_job_contents,meta={'item': item}, dont_filter=True)
               yield request
        except:
            break
    self.driver.close()

Please Help.

eLRuLL
  • 18,488
  • 9
  • 73
  • 99
akash
  • 1,801
  • 7
  • 24
  • 42
  • [This question](https://stackoverflow.com/questions/31174330/passing-selenium-response-url-to-scrapy%20helped) and [this one](https://stackoverflow.com/questions/19327406/how-to-set-different-scrapy-settings-for-different-spiders) helped me a lot – Henadzi Rabkin Dec 22 '19 at 22:28

2 Answers2

2

When using selenium and scrapy together, after having selenium perform the click I've read the page back for scrapy using

resp = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')

That would go where your HtmlXPathSelector selector line went. All the scrapy code from that point to the end of the routine would then need to refer to resp (page rendered after the click) rather than response (page rendered before the click).

The time.sleep(3) may give you issues as it doesn't guarantee the page has actually loaded, it's just an unconditional wait. It might be better to use something like

WebDriverWait(self.driver, 30).until(test page has changed)

which waits until the page you are waiting for passes a specific test, such as finding the expected page number or manufacturer's part number.

I'm not sure what the impact of closing the driver at the end of every pass through parse() is. I've used the following snippet in my spider to close the driver when the spider is closed.

def __init__(self, filename=None):
    # wire us up to selenium
    self.driver = webdriver.Firefox()
    dispatcher.connect(self.spider_closed, signals.spider_closed)

def spider_closed(self, spider):
    self.driver.close()
Steve
  • 976
  • 5
  • 15
0

Selenium isn't in any way connected with scrapy, nor their response object, and in your code I don't see you changing the response object.

You'll have to work with them independently.

eLRuLL
  • 18,488
  • 9
  • 73
  • 99