I’m struggling with getting scrapy (with or without selenium) to extract dynamically generated content from a web page. The site lists performance for different universities, and allows you to select each Study Area offered by that uni. As an example, from the page listed in the code below, I’d like to be able to extract university name (“Bond University”) and the value for ‘Overall quality of experience’ (91.3%).
However, when I use ‘view source’, curl or scrapy, the actual values aren’t shown. E.g. where I’d expect to see Uni name, it shows:
<h1 class="inline-block instiution-name" data-bind="text: Description"></h1>
But if I use firebug or chrome to inspect element, it shows the
<h1 class="inline-block instiution-name" data-bind="text: Description">Bond University</h1>
On further inspection, on the ‘Net’ tab in firebug, I can see that there’s an AJAX (?) call being made that returns the relevant information, but I haven’t been able to mimic this in scrapy or even curl (yes, I did search and spend an embarrassingly long time trying I’m afraid).
Request headers
POST /Websilk/DataServices/SurveyData.asmx/FetchInstitutionStudyAreaData HTTP/1.1
Host: www.qilt.edu.au
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:39.0) Gecko/20100101 Firefox/39.0
Accept: application/json, text/javascript, */*; q=0.01
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Content-Type: application/json; charset=utf-8
X-Requested-With: XMLHttpRequest
Referer: http://www.qilt.edu.au/institutions/institution/bond-university/business-management
Content-Length: 36
Cookie: _ga=GA1.3.69062787.1442441726; ASP.NET_SessionId=lueff4ysg3yvd2csv5ixsc1f; _gat=1
Connection: keep-alive
Pragma: no-cache
Cache-Control: no-cache
POST Parameters passed with the request
{"InstitutionId":20,"StudyAreaId":0}
As a second option, I tried using Selenium with scrapy, since I thought it might ‘see’ the real values, like the browser does, but to no avail. My main attempt thus far is below:
import scrapy
import time #used for the sleep() function
from selenium import webdriver
class QiltSpider(scrapy.Spider):
name = "qilt"
allowed_domains = ["qilt.edu.au"]
start_urls = [
"http://www.qilt.edu.au/institutions/institution/rmit-university/architecture-building/"
]
def __init__(self):
self.driver = webdriver.Firefox()
self.driver.get('http://www.qilt.edu.au/institutions/institution/rmit-university/architecture-building/')
time.sleep(5) # tried pausing, in case problem was delayed loading - didn't work
def parse(self, response):
# parse the response to find the uni name and show in console (using xpath code from firebug). This find the relevant section, but it shows as empty
title = response.xpath('//*[@id="bd"]/div[2]/div/div/div[1]/div/div[2]/h1').extract()
print title
# dumping the whole response to a file so I can check whether dynamic values were captured
with open("extract.html", 'wb') as f:
f.write(response.body)
self.driver.close()
Can anyone tell me how I can achieve this?
Many thanks!
EDIT: Thanks for the suggestions so far, but any thoughts on how to specifically mimic the AJAX call with parameters of InstitutionID and StudyAreaID? My code to test this was as below, but it seems to still hit an error page.
import scrapy
from scrapy.http import FormRequest
class HeaderTestSpider(scrapy.Spider):
name = "headerTest"
allowed_domains = ["qilt.edu.au"]
start_urls = [
"http://www.qilt.edu.au/institutions/institution/rmit-university/architecture-building/"
]
def parse(self, response):
return [FormRequest(url="http://www.qilt.edu.au/Websilk/DataServices/SurveyData.asmx/FetchInstitutionData",
method='POST',
formdata={'InstitutionId':'20', 'StudyAreaId': '0'},
callback=self.parser2)]