I am attempting to scrape data from one of my University's websites, which uses Shibboleth as a form of authentication/protection. However, I am having difficulty determining the best way to get past it and to the page I wish to scrape. I have valid credentials, which I could use to log in with. Does anyone have any suggestions for how to accomplish this task?
-
maybe you should google it, and keep it to your self – Ibu May 25 '11 at 04:07
-
@Ibu Why? He's not asking how to bypass the security, merely how to login programmatically. – Matthew Scharley May 25 '11 at 04:09
5 Answers
I have been working on scripting Shibbolized login with success ( in my case, to monitor the health of both the Shibboleth IdP and the applications it protects).
I am using Python's urllib
module and their classes to handle the redirect following and cookie passing (for Shibboleth) and login form posting. After a little bit of tinkering urllib gets you most of the way to success with Shibbolized login. You could use this approach to handle the initial login to the Shibbolized website and then handle the scraping with a straight forward use of Python's urllib
.
You can also try Apache JMeter, just record your actions, make some scripting (well it is not so easy in terms of shibboleth), and you can access this pages automatically.
[Edit - better solution] I believe that on Shibboleth Documentation pages are scripts for Grinder (another load testing tool). This test plans where in fact Python (ok Jython) scripts which should be quite easily modified and used for your purposes

- 522
- 4
- 20
Very late reply, but you could use Facebook Webdriver to do a login and scrape after you're authenticated.

- 439
- 6
- 17
You could use Mechanize to submit forms and login to the website: http://wwwsearch.sourceforge.net/mechanize/

- 28,392
- 37
- 134
- 178