I'm trying to scrape an asp.net page where I need to page through the items a list of items that are in a gridview control. I've never used asp.net but have been searching the Net for pointers but now I've hit a brick wall. The page links are of the form:
javascript:__doPostBack('ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$gridViewItems','Page$2')
I'm currently trying to get this working using Mechanize in Python. I initially tried the following, assuming that the VIEWSTATE variables would be handled by mechanize.
br.form.set_all_readonly(False)
br['__EVENTTARGET'] = 'ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$gridViewItems'
br['__EVENTARGUMENT'] = 'Page$2'
response = br.submit(name="ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$itemLocator$btnItemSearch")
html = br.response().read()
Using a network monitor(Fiddler2), I noticed that two more variables were populated so I added these in too:
br.select_form(nr=0)
br.form.new_control('hidden','ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ScriptManager1',attrs = dict(name='ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ScriptManager1'))
br.form.new_control('hidden','hiddenInputToUpdateATBuffer_CommonToolkitScripts',attrs = dict(name='hiddenInputToUpdateATBuffer_CommonToolkitScripts'))
br.form.new_control('hidden','__ASYNCPOST',attrs = dict(name='__ASYNCPOST'))
br.form.set_all_readonly(False)
br['hiddenInputToUpdateATBuffer_CommonToolkitScripts'] = '1'
br['__ASYNCPOST'] = 'TRUE'
br['ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ScriptManager1'] = 'ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$SearchResultsUpdatePanel|ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$gridViewItems'
br['__EVENTTARGET'] = 'ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$gridViewItems'
br['__EVENTARGUMENT'] = 'Page$2'
response = br.submit(name="ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$itemLocator$btnItemSearch")
html = br.response().read()
With both of these the html I get back is still for page 1 only.
I think there may be a couple of potential issues:
I'm not sure I'm doing the submit right. There are multiple submit buttons on the page so the one I'm searching for is the "search" button, which is what I previously used to get to the first page. I could see that being why the first page is displayed. If I use br.submit() without a name then it uses another submit control that takes you somewhere else.
When you click a page number in a browser, the gridview control updates without a page reload. As I'm not running Javascript, maybe I can't get that but I would at least expect to be able to get back the data from the POST and parse that.
Any help would be much appreciated!