I am trying to scrape data from a page which has a lot of AJAX calls and javascript execution to render the webpage.So I am trying to use scrapy with selenium to do this. The modus operandi is as follow :
Add the login page URL to the scrapy start_urls list
Use the formrequest from response method to post the username and password to get authenticated.
- Once logged in,request for the desired page to be scraped
- Pass this response to the Selenium Webdriver to click buttons on the page.
- Once the buttons are clicked and a new webpage is rendered,capture the result.
The code that I have thus far is as follows:
from scrapy.spider import BaseSpider
from scrapy.http import FormRequest, Request
from selenium import webdriver
import time
class LoginSpider(BaseSpider):
name = "sel_spid"
start_urls = ["http://www.example.com/login.aspx"]
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
return FormRequest.from_response(response,
formdata={'User': 'username', 'Pass': 'password'},
callback=self.check_login_response)
def check_login_response(self, response):
if "Log Out" in response.body:
self.log("Successfully logged in")
scrape_url = "http://www.example.com/authen_handler.aspx?SearchString=DWT+%3E%3d+500"
yield Request(url=scrape_url, callback=self.parse_page)
else:
self.log("Bad credentials")
def parse_page(self, response):
self.driver.get(response.url)
next = self.driver.find_element_by_class_name('dxWeb_pNext')
next.click()
time.sleep(2)
# capture the html and store in a file
The 2 roadblocks i have hit till now are:
Step 4 does not work.Whenever selenium open the firefox window,it is always at the login screen and does not know how to get past it.
I don't know how to achieve step 5
Any help will be greatly appreciated