4

I am trying to scrape data from a page which has a lot of AJAX calls and javascript execution to render the webpage.So I am trying to use scrapy with selenium to do this. The modus operandi is as follow :

  1. Add the login page URL to the scrapy start_urls list

  2. Use the formrequest from response method to post the username and password to get authenticated.

  3. Once logged in,request for the desired page to be scraped
  4. Pass this response to the Selenium Webdriver to click buttons on the page.
  5. Once the buttons are clicked and a new webpage is rendered,capture the result.

The code that I have thus far is as follows:

 from scrapy.spider import BaseSpider
 from scrapy.http import FormRequest, Request
 from selenium import webdriver
 import time


 class LoginSpider(BaseSpider):
    name = "sel_spid"
    start_urls = ["http://www.example.com/login.aspx"]


    def __init__(self):
        self.driver = webdriver.Firefox()


    def parse(self, response):
        return FormRequest.from_response(response,
               formdata={'User': 'username', 'Pass': 'password'},
               callback=self.check_login_response)

    def check_login_response(self, response):
        if "Log Out" in response.body:
            self.log("Successfully logged in")
            scrape_url = "http://www.example.com/authen_handler.aspx?SearchString=DWT+%3E%3d+500"
            yield Request(url=scrape_url, callback=self.parse_page)
        else:
            self.log("Bad credentials")

    def parse_page(self, response):
        self.driver.get(response.url)
        next = self.driver.find_element_by_class_name('dxWeb_pNext')
        next.click()
        time.sleep(2)
        # capture the html and store in a file

The 2 roadblocks i have hit till now are:

  1. Step 4 does not work.Whenever selenium open the firefox window,it is always at the login screen and does not know how to get past it.

  2. I don't know how to achieve step 5

Any help will be greatly appreciated

Amistad
  • 7,100
  • 13
  • 48
  • 75
  • 1
    Theoretically, you can pass the scrapy response cookies to the driver using `add_cookie` method, see: http://stackoverflow.com/questions/16563073/how-to-pass-scrapy-login-cookies-to-selenium and http://stackoverflow.com/questions/19082248/python-selenium-rc-create-cookie. Though, why don't log in using `selenium` as Eric suggested? Thanks. – alecxe Feb 10 '15 at 01:02
  • I could do that but I don't want to lose out on the awesome twisted code running under scrapy's hood..I plan to scrape a large number of URLs once I am authenticated and was hoping to do it in a non blocking way..is my thinking wrong ?? – Amistad Feb 10 '15 at 04:12

2 Answers2

2

I don't believe you can switch between scrapy Requests and selenium like that. You need to log into the site using selenium, not yield Request(). The login session you created with scrapy is not transfered to the selenium session. Here is an example (the element ids/xpath will be different for you):

    scrape_url = "http://www.example.com/authen_handler.aspx"
    driver.get(scrape_url)
    time.sleep(2)
    username = self.driver.find_element_by_id("User")
    password =  self.driver.find_element_by_name("Pass")
    username.send_keys("your_username")
    password.send_keys("your_password")
    self.driver.find_element_by_xpath("//input[@name='commit']").click()

then you can do:

    time.sleep(2)
    next = self.driver.find_element_by_class_name('dxWeb_pNext').click()
    time.sleep(2)

etc.

EDIT: If you need to render javascript and are worried about speed/non-blocking, you can use http://splash.readthedocs.org/en/latest/index.html which should do the trick.

http://splash.readthedocs.org/en/latest/scripting-ref.html#splash-add-cookie has details on passing a cookie, you should be able to pass it from scrapy, but I have not done it before.

Eric Valente
  • 439
  • 7
  • 14
0

log in with scrapy api first

# call scrapy post request with after_login as callback
    return FormRequest.from_response(
        response,
        # formxpath=formxpath,
        formdata=formdata,
        callback=self.browse_files
    )

pass session to selenium chrome driver

# logged in previously with scrapy api   
def browse_files(self, response):
    print "browse files for: %s" % (response.url)

    # response.headers        
    cookie_list2 = response.headers.getlist('Set-Cookie')
    print cookie_list2

    self.driver.get(response.url)
    self.driver.delete_all_cookies()

    # extract all the cookies
    for cookie2 in cookie_list2:
        cookies = map(lambda e: e.strip(), cookie2.split(";"))

        for cookie in cookies:
            splitted = cookie.split("=")
            if len(splitted) == 2:
                name = splitted[0]
                value = splitted[1]
                #for my particular usecase I needed only these values
                if name == 'csrftoken' or name == 'sessionid':
                    cookie_map = {"name": name, "value": value}
                else:
                    continue
            elif len(splitted) == 1:
                cookie_map = {"name": splitted[0], "value": ''}
            else:
                continue

            print "adding cookie"
            print cookie_map
            self.driver.add_cookie(cookie_map)

    self.driver.get(response.url)

    # check if we have successfully logged in
    files = self.wait_for_elements_to_be_present(By.XPATH, "//*[@id='files']", response)
    print files
cipri.l
  • 819
  • 10
  • 22