4

I'm not sure how to describe the problems properly, but anyway, so I want to use mechanize to grab the form and getting the name of the input. however, when I parse using mechanize, it does not show the form name and the input name. and if I try manually by looking at the website, I have to inspect the element so I can get the input name, but still, it's dynamic, so each time I inspect the element, it's giving me different name. any idea? by the way, the website I'm trying to parse is https://www.ursa.ucla.edu/logon/logon.asp if anyone interested.

Here's what I've tried:

  br = mechanize.Browser(factory=mechanize.RobustFactory())     
  br.open("https://www.ursa.ucla.edu/logon/logon.asp/")
  br.select_form(nr=0)
  print br.response().read()

Thanks in Advance, Richard.

GreenMatt
  • 18,244
  • 7
  • 53
  • 79
ordinaryman09
  • 2,761
  • 4
  • 26
  • 32
  • 1
    try [beautifulSoup](http://www.crummy.com/software/BeautifulSoup/), you could try parsing the page by using its xml/html tree structure instead of tag names. – c-ram Jan 22 '12 at 06:07
  • I tried beautifulSoup too, but it didnt work either. – ordinaryman09 Jan 22 '12 at 08:55

1 Answers1

1

The webpage you're trying to parse is not accessible directly. When you visit https://www.ursa.ucla.edu/logon/logon.asp it will do the following:

  1. Redirect you to https://shb.ais.ucla.edu/shibboleth-idp/profile/Shibboleth/SSO?shire=https%3A%2F%2Fwww.ursa.ucla.edu%2FShibboleth.sso%2FSAML%2FPOST&time=1327213354&target=cookie%3Aa872692c&providerId=https%3A%2F%2Fwww.ursa.ucla.edu%2Fshibboeth-sp (as you can see this has couple of variables - cookie, time..)
  2. The second page will redirect you to https://shb.ais.ucla.edu/shibboleth-idp/AuthnEngine
  3. Third page will redirect you to https://shb.ais.ucla.edu/shibboleth-idp/Authn/RemoteUser
  4. The last page will respond with 200 and send you markup with form and couple of hidden input fields. The form will submit itself onload, and only then on fifth response you will get the actual login page.

Now I don't know how python handles redirect headers. You may need to look at response that you are getting. In best case scenario it will be the last page with hidden variables, you will need to parse those and send POST request to the same url to get the real login page. In worst case scenario you will need to follow headers all the way from the first page.

valentinas
  • 4,277
  • 1
  • 20
  • 27