4

I want to scrape this web page using Mechanize. The form element looks like this:

<form name="ctl00" method="post" action="PSearchResults.aspx?state=ME&amp;rp=" id="ctl00"> 
<div> 
<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" /> 
<input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" /> 
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="verylongstring" /> </div> 
<input type="hidden" name="__EVENTVALIDATION" id="__EVENTVALIDATION" value="/wEWAgKb7POZAwK4v7ffCOmari00yJft/iuZBMdOH/zh9TDI" /> 
</div> 
</form>

I'm using Mechanize to print out the controls, but it can only see two of them. If I run this:

    br.select_form(name='ctl00')
    br.form.set_all_readonly(False) # allow changing the .value of all controls
    for control in br.form.controls:
        if not control.name:
            print " - (type) =", (control.type)
            continue  
        print " - (name, type, value) =", (control.name, control.type, br[control.name])

all that gets printed is this:

- (name, type, value) = ('__VIEWSTATE', 'hidden', '/wEPDwUGNDQ5NTMwD2QWAgIBD2QWAgIHD2QWCgIBDw8WAh4E...more
- (name, type, value) = ('__EVENTVALIDATION', 'hidden', '/wEWAgKb7POZAwK4v7ffCOmari00yJft/iuZBMdOH/zh9TDI')

Why can't Mechanize 'see' the __EVENTTARGET and __EVENTARGUMENT fields?

AP257
  • 89,519
  • 86
  • 202
  • 261

2 Answers2

6

The site is checking the useragent and serving a different page to mechanize

specifying this as the useragent seems to work ok

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6

Here is a link showing how to set the User-Agent with mechanize

John La Rooy
  • 295,403
  • 53
  • 369
  • 502
  • I disagree. Surely they are there in the source? There is a JavaScript 'doPostBack' function that sets the value of those fields, true - but they already existed: http://recognition.ncqa.org/PSearchResults.aspx?state=ME&rp= – AP257 Jul 27 '10 at 10:56
  • @AP257, download the page with wget, curl, or urllib2 and you will see that the source is quite different to what is displayed as the source in your browser – John La Rooy Jul 27 '10 at 11:36
  • @AP257, I revised my answer, was not javascript at all, but useragent testing :( – John La Rooy Jul 27 '10 at 12:30
  • Oh I see. I would never have worked that out by myself. Thank you so much! – AP257 Jul 27 '10 at 13:19
5

As a follow up, I had the same problem using mechanize (python) and I tried defining the UserAgent to

br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 5.2; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.47 Safari/536.11')]

as recommended by the site: http://stockrt.github.com/p/emulating-a-browser-in-python-with-mechanize/

However, this did not work so I opted to included the missing forms elements using the following code:

br.select_form(name='form')
br.form.set_all_readonly(False) # allow changing the .value of all controls
br.form.new_control('text','__EVENTARGUMENT',{'value':''})
br.form.new_control('text','__EVENTTARGET',{'value':''})
br.form.fixup()
br["__EVENTTARGET"] = 'lbSearch'
br["__EVENTARGUMENT"] = ''
TheChrisONeil
  • 395
  • 1
  • 6
  • 11