9

I have an ASPX page at https://searchlight.cluen.com/E5/CandidateSearch.aspx with a form on it, that I'd like to submit and parse for information.

Using Python's urllib and urllib2 I created a post request with the proper headers and user agent. But the resulting html response does not contain the expected table of results. Am I misunderstanding or am I missing any obvious details?

    import urllib
    import urllib2

    headers = {
        'HTTP_USER_AGENT': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.13)         Gecko/2009073022 Firefox/3.0.13',
        'HTTP_ACCEPT': 'text/html,application/xhtml+xml,application/xml; q=0.9,*/*; q=0.8',
        'Content-Type': 'application/x-www-form-urlencoded'
    }
    # obtained these values from viewing the source of https://searchlight.cluen.com/E5/CandidateSearch.aspx
    viewstate = '/wEPDwULLTE3NTc4MzQwNDIPZBYCAg ... uJRWDs/6Ks1FECco='
    eventvalidation = '/wEWjQMC8pat6g4C77jgxg0CzoqI8wgC3uWinQQCwr/ ... oPKYVeb74='
    url = 'https://searchlight.cluen.com/E5/CandidateSearch.aspx'
    formData = (
        ('__VIEWSTATE', viewstate),
        ('__EVENTVALIDATION', eventvalidation),
        ('__EVENTTARGET',''),
        ('__EVENTARGUMENT',''),
        ('textcity',''),
        ('dropdownlistposition',''),
        ('dropdownlistdepartment',''),
        ('dropdownlistorderby',''),
        ('textsearch',''),
    )

    # change user agent
    from urllib import FancyURLopener
    class MyOpener(FancyURLopener):
        version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127         Firefox/2.0.0.11'

    myopener = MyOpener()

    # encode form data in post-request format
    encodedFields = urllib.urlencode(formData)

    f = myopener.open(url, encodedFields)
    print f.info()

    try:
      fout = open('tmp.htm', 'w')
    except:
      print('Could not open output file\n')

    fout.writelines(f.readlines())
    fout.close()

There are several questions on this topic that were helpful (such as how to submit query to .aspx page in python) but I'm stuck on this and asking for additional help, if that is possible.

The resulting html page is saying I may need to log in, but the aspx page displays in my browser without any login.

Here are the results from info():

Connection: close Date: Tue, 07 Jun 2011 17:05:26 GMT Server: Microsoft-IIS/6.0 X-Powered-By: ASP.NET X-AspNet-Version: 2.0.50727 Cache-Control: private Content-Type: text/html; charset=utf-8 Content-Length: 1944

Community
  • 1
  • 1
user773328
  • 323
  • 5
  • 11
  • Upon a quick glance, I don't notice anything wrong with your code. I tried visiting the website in my browser (Firefox 4.0) and received the following message: `An error has occurred in processing your request. Please try again (you may need to log back in). ...` – Gregg Jun 07 '11 at 17:24
  • Could the target aspx page be looking for something in the session and tripping up because it didn't have the aspnet session cookie in the request you perform your post in? – ashelvey Jun 07 '11 at 17:31
  • Thanks for your answers. I can visit the site in my browser because I append the login info, which I did not include here. Yes, it seems like a session issue between asp.net and my simulated browser. – user773328 Jun 08 '11 at 13:59

2 Answers2

7

ASP.Net uses a security feature that protects against tampering with the ViewState by embedding specific information in it.

More than likely, the server is rejecting your request because the ViewState is being treated as though it were tampered with. I can't say this with absolute certainty, but ASP.Net has several security features that are built in to the framework that may be preventing a direct post.

If session is involved at all, then you will also need to take that into account. To simulate what the browser is doing you will need to perform the following steps:

  1. Request the page.
  2. Save the collection of cookies to a variable.
  3. Extract the ViewState to a variable.
  4. Post with the appropriate form values, passing both the saved cookies and ViewState information along with the request.

A lot of work I know, but not too awfully difficult. Again, this may not be the sole source of your problems, but it is worth reading up on in order to start troubleshooting.

Josh
  • 44,706
  • 7
  • 102
  • 124
  • Thank you for that answer, I see that the page does expire in my browser and requests a login after sometime, so this might be solved in part by adding a cookie to the request. Would you have any tips on how to save the collection of cookies to a variable, as mentioned in step 2? – user773328 Jun 07 '11 at 19:02
  • In .Net it is pretty easy because there is a CookiesCollection associated with the HttpWebRequest object. Unfortunately my knowledge of Python is slim to none, but I was able to scrounge up this resource: http://www.voidspace.org.uk/python/articles/cookielib.shtml – Josh Jun 07 '11 at 19:36
  • Excuse me once again, but I have been succesfully using urllib2 and cookielib to make multiple requests with the same cookie to a sample page such as amazon.com. I have also read your links, now I am trying to understand how to approach step 4, Extract the ViewState to a variable, which I tried to do in the above code. – user773328 Jun 08 '11 at 14:15
2

I tried mechanize and urllib2, and mechanize handles cookies better. I can submit the form simply by specifying with mechanize:

    browser= mechanize.Browser()
    browser.select_form(form_name)
    browser.set_value("Page$Next", name="pagenumber")     

It was not necessary to replicate the post request manually, and mechanize in this case was able to handle a form that relies on javascript.

user773328
  • 323
  • 5
  • 11