I just started to use scrapy and Selenium, and I'm having some problems in scrapping a webpage that has infinite scrolling:
http://observador.pt/opiniao/autor/ahcristo
So, I want to extract the links for each entry (political texts). With scrapy alone is not possible because one need to do scrolling down in order to all entries show up. I'm using selenium to simulate chrome browser and scrolling dow. My problem is that the scrolling is not working. I made the code based in other similar examples like this or this. The code is counting the number of total of entry links after each scrolling, and if it was working ok, it should increase after each step. It is giving me a constant number of 24 links.
## -*- coding: utf-8 -*-
import scrapy
from selenium import webdriver
import time
from observador.items import ObservadorItem
class OpinionSpider(scrapy.Spider):
name = "opinionspider"
start_urls = ["http://observador.pt/opiniao/"]
def __init__(self):
self.driver = webdriver.Chrome()
def parse(self, response):
# Colunistas
for url in response.xpath('//*[@id="main"]/div/div[1]/ul/li[1]/div/ul/li[*]/a/@href').extract():
# test for a single author
if url == 'http://observador.pt/opiniao/autor/ahcristo':
yield scrapy.Request(url,callback=self.parse_author_main_page)
else:
continue
def parse_author_main_page(self,response):
self.driver.get(response.url)
count = 0
for url in response.xpath('//*[@id="main"]/div/div[3]/div[1]/article[*]/h1/a/@href').extract():
count += 1
print "Number of links: ",count
for i in range(1,100):
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(4)
count = 0
for url in response.xpath('//*[@id="main"]/div/div[3]/div[1]/article[*]/h1/a/@href').extract():
count += 1
print "Number of links: ",count
self.driver.close()