I m trying to scrape a website that uses Ajax to load the different pages.
Although my selenium browser is navigating through all the pages, but scrapy response is still the same and it ends up scraping same response(no of pages times).
Proposed Solution :
I read in some answers that by using
hxs = HtmlXPathSelector(self.driver.page_source)
You can change the page source and then scrape. But it is not working ,also after adding this the browser also stopped navigating.
code
def parse(self, response):
self.driver.get(response.url)
pages = (int)(response.xpath('//p[@class="pageingP"]/a/text()')[-2].extract())
for i in range(pages):
next = self.driver.find_element_by_xpath('//a[text()="Next"]')
print response.xpath('//div[@id="searchResultDiv"]/h3/text()').extract()[0]
try:
next.click()
time.sleep(3)
#hxs = HtmlXPathSelector(self.driver.page_source)
for sel in response.xpath("//tr/td/a"):
item = WarnerbrosItem()
item['url'] = response.urljoin(sel.xpath('@href').extract()[0])
request = scrapy.Request(item['url'],callback=self.parse_job_contents,meta={'item': item}, dont_filter=True)
yield request
except:
break
self.driver.close()
Please Help.