1

I'm trying to do the following:

  1. log into a web page (in my case zendesk.com)
  2. use that session to do some post requests

In fact zendesk misses some apis (create/alter macros) which I now need to simulate simulating a browser session.

So I'm not writing a spider but try to interact with the website as my script proceeds. The post requests are not known from the start but only during my script.

In the Scrapy docs, there is the following example to illustrate how to use an authenticated session in Scrapy:

class LoginSpider(BaseSpider):
  name = 'example.com'
  start_urls = ['http://www.example.com/users/login.php']

  def parse(self, response):
    return [FormRequest.from_response(response,
                formdata={'username': 'john', 'password': 'secret'},
                callback=self.after_login)]

  def after_login(self, response):
    # check login succeed before going on
    if "authentication failed" in response.body:
        self.log("Login failed", level=log.ERROR)
        return

    # continue scraping with authenticated session...

But it looks like this only works for scraping, but in my case I just want to "hold" the session and further work with that session. Is there a way to achieve this with scrapy, or are there tools that better fit this task?

warvariuc
  • 57,116
  • 41
  • 173
  • 227
hansaplast
  • 11,007
  • 2
  • 61
  • 75
  • 1
    I don't think scrapy is the right tool for you. scrapy is for scraping, it doesn't make sense to login and "hold" it. Try to login using urllib: http://stackoverflow.com/q/189555/248296 – warvariuc Jul 12 '12 at 09:28

1 Answers1

1

Thanks a lot @wawaruk. Based on the stackoverflow post you linked that's the solution I came up with:

import urllib, urllib2, cookielib, re

zendesk_subdomain = 'mysub'
zendesk_username = '...'
zendesk_password = '...'

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
resp = opener.open('http://%s.zendesk.com/access/unauthenticated' % (zendesk_subdomain))
s = resp.read()
data = dict()
data['authenticity_token'] = re.findall('<input name="authenticity_token" type="hidden" value="([^"]+)"', s)[0]
data['return_to'] = 'http://%s.zendesk.com/login' % zendesk_subdomain
data['user[email]'] = zendesk_username
data['user[password]'] = zendesk_password
data['commit'] = 'Log in'
data['remember_me'] = '1'

opener.open('https://localch.zendesk.com/access/login', urllib.urlencode(data))

from there with opener all pages can be accessed, e.g.

opener.open('http://%s.zendesk.com/rules/new?filter=macro' % zendesk_subdomain)
hansaplast
  • 11,007
  • 2
  • 61
  • 75
  • 1
    mechanize can simplify your life even more: http://wwwsearch.sourceforge.net/mechanize – Paulo Freitas Jul 16 '12 at 05:44
  • I finally wrapped it all into a class and added method handling (especially needing PUT for adding zendesk macros): http://pastie.org/private/zxc7samxczqbiw29v91fg – hansaplast Jul 16 '12 at 06:39