26

I need to scrape query results from an .aspx web page.

http://legistar.council.nyc.gov/Legislation.aspx

The url is static, so how do I submit a query to this page and get the results? Assume we need to select "all years" and "all types" from the respective dropdown menus.

Somebody out there must know how to do this.

Jason Whitehorn
  • 13,585
  • 9
  • 54
  • 68
twneale
  • 2,836
  • 4
  • 29
  • 34

5 Answers5

30

As an overview, you will need to perform four main tasks:

  • to submit request(s) to the web site,
  • to retrieve the response(s) from the site
  • to parse these responses
  • to have some logic to iterate in the tasks above, with parameters associated with the navigation (to "next" pages in the results list)

The http request and response handling is done with methods and classes from Python's standard library's urllib and urllib2. The parsing of the html pages can be done with Python's standard library's HTMLParser or with other modules such as Beautiful Soup

The following snippet demonstrates the requesting and receiving of a search at the site indicated in the question. This site is ASP-driven and as a result we need to ensure that we send several form fields, some of them with 'horrible' values as these are used by the ASP logic to maintain state and to authenticate the request to some extent. Indeed submitting. The requests have to be sent with the http POST method as this is what is expected from this ASP application. The main difficulty is with identifying the form field and associated values which ASP expects (getting pages with Python is the easy part).

This code is functional, or more precisely, was functional, until I removed most of the VSTATE value, and possibly introduced a typo or two by adding comments.

import urllib
import urllib2

uri = 'http://legistar.council.nyc.gov/Legislation.aspx'

#the http headers are useful to simulate a particular browser (some sites deny
#access to non-browsers (bots, etc.)
#also needed to pass the content type. 
headers = {
    'HTTP_USER_AGENT': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.13) Gecko/2009073022 Firefox/3.0.13',
    'HTTP_ACCEPT': 'text/html,application/xhtml+xml,application/xml; q=0.9,*/*; q=0.8',
    'Content-Type': 'application/x-www-form-urlencoded'
}

# we group the form fields and their values in a list (any
# iterable, actually) of name-value tuples.  This helps
# with clarity and also makes it easy to later encoding of them.

formFields = (
   # the viewstate is actualy 800+ characters in length! I truncated it
   # for this sample code.  It can be lifted from the first page
   # obtained from the site.  It may be ok to hardcode this value, or
   # it may have to be refreshed each time / each day, by essentially
   # running an extra page request and parse, for this specific value.
   (r'__VSTATE', r'7TzretNIlrZiKb7EOB3AQE ... ...2qd6g5xD8CGXm5EftXtNPt+H8B'),

   # following are more of these ASP form fields
   (r'__VIEWSTATE', r''),
   (r'__EVENTVALIDATION', r'/wEWDwL+raDpAgKnpt8nAs3q+pQOAs3q/pQOAs3qgpUOAs3qhpUOAoPE36ANAve684YCAoOs79EIAoOs89EIAoOs99EIAoOs39EIAoOs49EIAoOs09EIAoSs99EI6IQ74SEV9n4XbtWm1rEbB6Ic3/M='),
   (r'ctl00_RadScriptManager1_HiddenField', ''), 
   (r'ctl00_tabTop_ClientState', ''), 
   (r'ctl00_ContentPlaceHolder1_menuMain_ClientState', ''),
   (r'ctl00_ContentPlaceHolder1_gridMain_ClientState', ''),

   #but then we come to fields of interest: the search
   #criteria the collections to search from etc.
                                                       # Check boxes  
   (r'ctl00$ContentPlaceHolder1$chkOptions$0', 'on'),  # file number
   (r'ctl00$ContentPlaceHolder1$chkOptions$1', 'on'),  # Legislative text
   (r'ctl00$ContentPlaceHolder1$chkOptions$2', 'on'),  # attachement
                                                       # etc. (not all listed)
   (r'ctl00$ContentPlaceHolder1$txtSearch', 'york'),   # Search text
   (r'ctl00$ContentPlaceHolder1$lstYears', 'All Years'),  # Years to include
   (r'ctl00$ContentPlaceHolder1$lstTypeBasic', 'All Types'),  #types to include
   (r'ctl00$ContentPlaceHolder1$btnSearch', 'Search Legislation')  # Search button itself
)

# these have to be encoded    
encodedFields = urllib.urlencode(formFields)

req = urllib2.Request(uri, encodedFields, headers)
f= urllib2.urlopen(req)     #that's the actual call to the http site.

# *** here would normally be the in-memory parsing of f 
#     contents, but instead I store this to file
#     this is useful during design, allowing to have a
#     sample of what is to be parsed in a text editor, for analysis.

try:
  fout = open('tmp.htm', 'w')
except:
  print('Could not open output file\n')

fout.writelines(f.readlines())
fout.close()

That's about it for the getting of the initial page. As said above, then one would need to parse the page, i.e. find the parts of interest and gather them as appropriate, and store them to file/database/whereever. This job can be done in very many ways: using html parsers, or XSLT type of technogies (indeed after parsing the html to xml), or even for crude jobs, simple regular-expression. Also, one of the items one typically extracts is the "next info", i.e. a link of sorts, that can be used in a new request to the server to get subsequent pages.

This should give you a rough flavor of what "long hand" html scraping is about. There are many other approaches to this, such as dedicated utilties, scripts in Mozilla's (FireFox) GreaseMonkey plug-in, XSLT...

mjv
  • 73,152
  • 14
  • 113
  • 156
  • If I am using Google Chrome, then how should I replace the value for 'HTTP_USER_AGENT'? I'm sorry if this question is dumb since I did not do much web stuff. Thanks! – taocp Sep 17 '13 at 01:14
  • @taocp, an easy way to know what `HTTP_USER_AGENT` string to use for a given browser is to visit http://www.all-nettools.com/toolbox/environmental-variables-test.php this page will show you the header values sent by the browser, look for "HTTP_USER_AGENT". The actual string depends on the OS and specific version and build of Chrome, but should look some' like `Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36` – mjv Sep 17 '13 at 04:26
  • thanks a lot for your reply. I tried your code with proper values set to my chrome browser. The result tmp.htm file says "no results found", while when I put "york" on the website itself, it returns a lot. Do you know why? – taocp Sep 19 '13 at 00:29
  • @mjv I have a similar question to this. but am still unable to follow the concepts.my thread is here http://stackoverflow.com/questions/32638741/post-forms-using-requests-on-net-website-python. if you could help me out. I'll appreciate it a lot man this has been bugging me for a while now – Zion Sep 17 '15 at 20:39
  • Could anyone elaborate on how to do this using Python requests module? I feel like that would be much easier... – user32882 Jan 12 '16 at 05:54
  • worked like a charm for me. I didn't have viewstate and other things in my site. apart from this for checking header values after filling first time data & getting value, inspect element, go to networks, click on the first page link, and go to headers. you will get all header values. I got even form data keys. – sarvajeetsuman Jul 26 '17 at 11:04
  • view state field keeps changing for every query. How would you know the viewstate? – Aditya Chawla Jul 27 '18 at 06:06
5

Most ASP.NET sites (the one you referenced included) will actually post their queries back to themselves using the HTTP POST verb, not the GET verb. That is why the URL is not changing as you noted.

What you will need to do is look at the generated HTML and capture all their form values. Be sure to capture all the form values, as some of them are used to page validation and without them your POST request will be denied.

Other than the validation, an ASPX page in regards to scraping and posting is no different than other web technologies.

Jason Whitehorn
  • 13,585
  • 9
  • 54
  • 68
5

Selenium is a great tool to use for this kind of task. You can specify the form values that you want to enter and retrieve the html of the response page as a string in a couple of lines of python code. Using Selenium you might not have to do the manual work of simulating a valid post request and all of its hidden variables, as I found out after much trial and error.

user773328
  • 323
  • 5
  • 11
  • 1
    I was successful in connecting , logging in and clicking on links using selenium I am stuck at the part where you want to grab data from a page. Since the URI stays the same even after clicking, this poses a problem . – Arindam Roychowdhury Jan 12 '17 at 10:47
4

The code in the other answers was useful; I never would have been able to write my crawler without it.

One problem I did come across was cookies. The site I was crawling was using cookies to log session id/security stuff, so I had to add code to get my crawler to work:

Add this import:

    import cookielib            

Init the cookie stuff:

    COOKIEFILE = 'cookies.lwp'          # the path and filename that you want to use to save your cookies in
    cj = cookielib.LWPCookieJar()       # This is a subclass of FileCookieJar that has useful load and save methods

Install CookieJar so that it is used as the default CookieProcessor in the default opener handler:

    cj.load(COOKIEFILE)
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    urllib2.install_opener(opener)

To see what cookies the site is using:

    print 'These are the cookies we have received so far :'

    for index, cookie in enumerate(cj):
        print index, '  :  ', cookie        

This saves the cookies:

    cj.save(COOKIEFILE)                     # save the cookies 
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
bill smith
  • 41
  • 1
0

"Assume we need to select "all years" and "all types" from the respective dropdown menus."

What do these options do to the URL that is ultimately submitted.

After all, it amounts to an HTTP request sent via urllib2.

Do know how to do '"all years" and "all types" from the respective dropdown menus' you do the following.

  1. Select '"all years" and "all types" from the respective dropdown menus'

  2. Note the URL which is actually submitted.

  3. Use this URL in urllib2.

S.Lott
  • 384,516
  • 81
  • 508
  • 779
  • Apparently the page is a form requiring POST, but the idea is the same: take note of the form field name and of the value associated with 'All years' and witn 'all types' and use urlib2.Request to get to the data. – mjv Sep 26 '09 at 03:38
  • I'm using the Charles web debugging proxy to watch all the http traffic when I surf this site and submit queries, and the url is completely static. It contains no parameters at all. There is form data to passes somehow--ajax, I guess--but I don't know how to submit that form data to the server. It all looks unintelligible to me. The fact that I can't submit a query by manipulating the url is what's confusing me. – twneale Sep 26 '09 at 03:39
  • Once you get the results form this page, if you wish to scarpe it, you may use python module HTMLParser or Beautifulsoup to parse the html page. Also scraping will likely involve more urlib2 calls to navigate to the next pages of results. – mjv Sep 26 '09 at 03:41