2

I'm trying to scrape an asp.net page where I need to page through the items a list of items that are in a gridview control. I've never used asp.net but have been searching the Net for pointers but now I've hit a brick wall. The page links are of the form:

javascript:__doPostBack('ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$gridViewItems','Page$2')

I'm currently trying to get this working using Mechanize in Python. I initially tried the following, assuming that the VIEWSTATE variables would be handled by mechanize.

br.form.set_all_readonly(False)
br['__EVENTTARGET'] = 'ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$gridViewItems'
br['__EVENTARGUMENT'] = 'Page$2'
response = br.submit(name="ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$itemLocator$btnItemSearch")
html = br.response().read()

Using a network monitor(Fiddler2), I noticed that two more variables were populated so I added these in too:

br.select_form(nr=0)
br.form.new_control('hidden','ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ScriptManager1',attrs = dict(name='ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ScriptManager1'))
br.form.new_control('hidden','hiddenInputToUpdateATBuffer_CommonToolkitScripts',attrs = dict(name='hiddenInputToUpdateATBuffer_CommonToolkitScripts'))
br.form.new_control('hidden','__ASYNCPOST',attrs = dict(name='__ASYNCPOST'))
br.form.set_all_readonly(False)
br['hiddenInputToUpdateATBuffer_CommonToolkitScripts'] = '1'
br['__ASYNCPOST'] = 'TRUE'
br['ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ScriptManager1'] = 'ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$SearchResultsUpdatePanel|ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$gridViewItems'
br['__EVENTTARGET'] = 'ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$gridViewItems'
br['__EVENTARGUMENT'] = 'Page$2'
response = br.submit(name="ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$itemLocator$btnItemSearch")
html = br.response().read()

With both of these the html I get back is still for page 1 only.

I think there may be a couple of potential issues:

  1. I'm not sure I'm doing the submit right. There are multiple submit buttons on the page so the one I'm searching for is the "search" button, which is what I previously used to get to the first page. I could see that being why the first page is displayed. If I use br.submit() without a name then it uses another submit control that takes you somewhere else.

  2. When you click a page number in a browser, the gridview control updates without a page reload. As I'm not running Javascript, maybe I can't get that but I would at least expect to be able to get back the data from the POST and parse that.

Any help would be much appreciated!

alan
  • 4,247
  • 7
  • 37
  • 49
  • can you use something like Selenium/WebDriver (python bindings) instead of Mechanize? ... that lets the browser do all this correlation work for you. – Corey Goldberg Jun 13 '11 at 15:34
  • Was thinking of trying webkit in pyQt but would prefer to solve it with Mechanize. This is the only javascript bit needed and it seems so simple there's got to be a way round it! – alan Jun 13 '11 at 16:25
  • before you submit the form you can do `print br.form` to get a visual representation of all the fields in the form with the values you've entered. But as Corey Goldberg says Selenium Webdriver might be less hassle depending on your needs. – cerberos Jun 14 '11 at 01:45

1 Answers1

1

Managed to to it by building an xmlhttprequest per the answer here:

Using Python and Mechanize to submit form data and authenticate

Community
  • 1
  • 1
alan
  • 4,247
  • 7
  • 37
  • 49