0

I just started to use scrapy and Selenium, and I'm having some problems in scrapping a webpage that has infinite scrolling:

http://observador.pt/opiniao/autor/ahcristo

So, I want to extract the links for each entry (political texts). With scrapy alone is not possible because one need to do scrolling down in order to all entries show up. I'm using selenium to simulate chrome browser and scrolling dow. My problem is that the scrolling is not working. I made the code based in other similar examples like this or this. The code is counting the number of total of entry links after each scrolling, and if it was working ok, it should increase after each step. It is giving me a constant number of 24 links.

 ## -*- coding: utf-8 -*-

import scrapy
from selenium import webdriver
import time

from observador.items import ObservadorItem

class OpinionSpider(scrapy.Spider):
    name = "opinionspider"
    start_urls = ["http://observador.pt/opiniao/"]

    def __init__(self):
        self.driver = webdriver.Chrome()

    def parse(self, response):
        # Colunistas
        for url in response.xpath('//*[@id="main"]/div/div[1]/ul/li[1]/div/ul/li[*]/a/@href').extract():
            # test for a single author
            if url == 'http://observador.pt/opiniao/autor/ahcristo':            
                yield scrapy.Request(url,callback=self.parse_author_main_page)
            else:
                continue

    def parse_author_main_page(self,response):
        self.driver.get(response.url)

        count = 0
        for url in response.xpath('//*[@id="main"]/div/div[3]/div[1]/article[*]/h1/a/@href').extract():
            count += 1
        print "Number of links: ",count

        for i in range(1,100):
            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(4)

            count = 0
            for url in response.xpath('//*[@id="main"]/div/div[3]/div[1]/article[*]/h1/a/@href').extract():
                count += 1
        print "Number of links: ",count

        self.driver.close()
Community
  • 1
  • 1
Miguel
  • 2,738
  • 3
  • 35
  • 51

1 Answers1

1

You way of solving this with Selenium may be a bit overkill.

If you look at how the webpage you want to scrap works, it's simply loading the articles by doing an AJAX request (it POSTs on the /wp-admin/admin-ajax.php page).

Simply try to replicate how the javascript code that loads the articles works in your spider. I will be much faster and easier.

Here is a working cURL query to retrieve some articles

 curl 'http://observador.pt/wp-admin/admin-ajax.php'
      -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8'
      --data 'action=obs_get_latest_articles&offset=2&section=author&scroll_type=usual&data_id=74&data_type=1&exclude=&nonce=5145441fea'
Eloims
  • 5,106
  • 4
  • 25
  • 41
  • Ok, thanks for the answer. How did you find out? I confess I don't understand very much your answer, and I was showing you an example for the author "http://observador.pt/opiniao/autor/ahcristo". I need to apply the same scrapping to all different authors. – Miguel Apr 27 '16 at 16:47