how to parse website that doesn't show the codes in the view source?

Question

I'm not sure how to describe the problems properly, but anyway, so I want to use mechanize to grab the form and getting the name of the input. however, when I parse using mechanize, it does not show the form name and the input name. and if I try manually by looking at the website, I have to inspect the element so I can get the input name, but still, it's dynamic, so each time I inspect the element, it's giving me different name. any idea? by the way, the website I'm trying to parse is https://www.ursa.ucla.edu/logon/logon.asp if anyone interested.

Here's what I've tried:

  br = mechanize.Browser(factory=mechanize.RobustFactory())     
  br.open("https://www.ursa.ucla.edu/logon/logon.asp/")
  br.select_form(nr=0)
  print br.response().read()

Thanks in Advance, Richard.

try [beautifulSoup](http://www.crummy.com/software/BeautifulSoup/), you could try parsing the page by using its xml/html tree structure instead of tag names. — c-ram, Jan 22 '12 at 06:07

score 1 · Accepted Answer · answered Jan 22 '12 at 06:36

The webpage you're trying to parse is not accessible directly. When you visit https://www.ursa.ucla.edu/logon/logon.asp it will do the following:

Redirect you to https://shb.ais.ucla.edu/shibboleth-idp/profile/Shibboleth/SSO?shire=https%3A%2F%2Fwww.ursa.ucla.edu%2FShibboleth.sso%2FSAML%2FPOST&time=1327213354&target=cookie%3Aa872692c&providerId=https%3A%2F%2Fwww.ursa.ucla.edu%2Fshibboeth-sp (as you can see this has couple of variables - cookie, time..)
The second page will redirect you to https://shb.ais.ucla.edu/shibboleth-idp/AuthnEngine
Third page will redirect you to https://shb.ais.ucla.edu/shibboleth-idp/Authn/RemoteUser
The last page will respond with 200 and send you markup with form and couple of hidden input fields. The form will submit itself onload, and only then on fifth response you will get the actual login page.

Now I don't know how python handles redirect headers. You may need to look at response that you are getting. In best case scenario it will be the last page with hidden variables, you will need to parse those and send POST request to the same url to get the real login page. In worst case scenario you will need to follow headers all the way from the first page.

can you explain a little more how to follow the headers from the first page? TIA. — ordinaryman09, Jan 22 '12 at 08:59
The following touches on an approach to catching redirects using urllib2: http://stackoverflow.com/a/8794765/1104941 — sgallen, Jan 22 '12 at 14:13

how to parse website that doesn't show the codes in the view source?

1 Answers1