Scraping dynamically generated data with scrapy (and selenium?)

Question

I’m struggling with getting scrapy (with or without selenium) to extract dynamically generated content from a web page. The site lists performance for different universities, and allows you to select each Study Area offered by that uni. As an example, from the page listed in the code below, I’d like to be able to extract university name (“Bond University”) and the value for ‘Overall quality of experience’ (91.3%).

However, when I use ‘view source’, curl or scrapy, the actual values aren’t shown. E.g. where I’d expect to see Uni name, it shows:

<h1 class="inline-block instiution-name" data-bind="text: Description"></h1>

But if I use firebug or chrome to inspect element, it shows the

<h1 class="inline-block instiution-name" data-bind="text: Description">Bond University</h1>

On further inspection, on the ‘Net’ tab in firebug, I can see that there’s an AJAX (?) call being made that returns the relevant information, but I haven’t been able to mimic this in scrapy or even curl (yes, I did search and spend an embarrassingly long time trying I’m afraid).

Request headers

POST /Websilk/DataServices/SurveyData.asmx/FetchInstitutionStudyAreaData HTTP/1.1
Host: www.qilt.edu.au
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:39.0) Gecko/20100101 Firefox/39.0
Accept: application/json, text/javascript, */*; q=0.01
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Content-Type: application/json; charset=utf-8
X-Requested-With: XMLHttpRequest
Referer: http://www.qilt.edu.au/institutions/institution/bond-university/business-management
Content-Length: 36
Cookie: _ga=GA1.3.69062787.1442441726; ASP.NET_SessionId=lueff4ysg3yvd2csv5ixsc1f; _gat=1
Connection: keep-alive
Pragma: no-cache
Cache-Control: no-cache

POST Parameters passed with the request

{"InstitutionId":20,"StudyAreaId":0}

As a second option, I tried using Selenium with scrapy, since I thought it might ‘see’ the real values, like the browser does, but to no avail. My main attempt thus far is below:

import scrapy
import time  #used for the sleep() function

from selenium import webdriver

class QiltSpider(scrapy.Spider):
    name = "qilt"

    allowed_domains = ["qilt.edu.au"]
    start_urls = [
        "http://www.qilt.edu.au/institutions/institution/rmit-university/architecture-building/"
    ]

    def __init__(self):
        self.driver = webdriver.Firefox()
        self.driver.get('http://www.qilt.edu.au/institutions/institution/rmit-university/architecture-building/')
        time.sleep(5) # tried pausing, in case problem was delayed loading - didn't work

    def parse(self, response):
        # parse the response to find the uni name and show in console (using xpath code from firebug). This find the relevant section, but it shows as empty
        title = response.xpath('//*[@id="bd"]/div[2]/div/div/div[1]/div/div[2]/h1').extract()
        print title
        # dumping the whole response to a file so I can check whether dynamic values were captured
        with open("extract.html", 'wb') as f:
            f.write(response.body)
            self.driver.close()

Can anyone tell me how I can achieve this?

Many thanks!

EDIT: Thanks for the suggestions so far, but any thoughts on how to specifically mimic the AJAX call with parameters of InstitutionID and StudyAreaID? My code to test this was as below, but it seems to still hit an error page.

import scrapy
from scrapy.http import FormRequest

class HeaderTestSpider(scrapy.Spider):
    name = "headerTest"

    allowed_domains = ["qilt.edu.au"]
    start_urls = [
        "http://www.qilt.edu.au/institutions/institution/rmit-university/architecture-building/"
    ]

    def parse(self, response):
        return [FormRequest(url="http://www.qilt.edu.au/Websilk/DataServices/SurveyData.asmx/FetchInstitutionData",
                            method='POST',  
                            formdata={'InstitutionId':'20', 'StudyAreaId': '0'},
                            callback=self.parser2)]

You can use [Requests](http://www.python-requests.org/en/latest/) library and mimic the *AJAX* call being made. — Vikas Ojha, Sep 17 '15 at 12:20
Why not just use Selenium and scrape the data off the page once it's rendered in the browser? — JeffC, Sep 17 '15 at 14:07
solution here: http://stackoverflow.com/a/24373576/2368836 You might have to add an implicit wait after the driver.get — rocktheartsm4l, Sep 17 '15 at 16:25
Thanks for the replies. I did read the other forum previously and tried to replicate the method used by Badarau Petru. In regard to your suggestion about middleware, other than just enabling it in settings.py, I wasn't sure how to mimic the AJAX call particularly with the parameters for InstitutionID and StudyAreaID. Unfortunately, I don't get to play with python regularly, so it may be beyond me. I've actually created an elance job to see if I can learn from the code they come up with. — Tango delta, Sep 18 '15 at 22:28

tech poy · Accepted Answer · 2015-09-19T14:28:27.453

QILT page uses AJAX to retrieve the data from server. This AJAX request is sent using a javascript code which is fired using the even document.ready(jQuery)/window.onload(Javascript) (If you are not familiar with javascript, this method is fired as soon as the web page finished loading on the browser window). Since you are using a software to stimulate the page requests, this event is not fired at all.

For the AJAX request you are trying to simulate, The request body is of type Application/JSON. please add the following header to the request. Content-Type: application/json

Scraping dynamically generated data with scrapy (and selenium?)

1 Answers1