I am trying to scrape a website protected with a Shibboleth authentification. I need to login and read its content programmatically.
I successfully logged in using the Python Mechanize package. However, the content I am looking for is loaded with Javascript and Mechanize doesn't handle Javascript.
To this end, I tried to login using PhantomJS which handles Javascript, but the website violently slammed the door in my face: "In order to access the resource, you must authenticate yourself".
I realize that I need both tools to achieve my task:
- Mechanize for a successful login,
- PhantomJS to hopefully get my data (?).
The only thing I would need is to pass cookies from Mechanize to PhantomJS. Is that possible?
Mechanize
#saving Mechanize's cookies
cj.save("MechanizeCookies.txt")
MechanizeCookies.txt
#LWP-Cookies-2.0 Set-Cookie3: _saml_idp=aHR0cHM6Ly9zaGliYm9sZXRoLmVuc2ljYWVXXXXXX; path="/"; domain=".xxxxxxx.fr"; path_spec; domain_dot; expires="2017-02-19 19:08:16Z"; version=0 Set-Cookie3: org.jasig.portal.PORTLET_COOKIE=OgkWalk7G5Woc3Vy_LdMdLakE8GHXXXXXXXX; path="/uPortal/"; domain="ent.xxxxxxx.fr"; path_spec; expires="2015-05-26 19:08:23Z"; version=0
PhantomJS
Here is my try with PhantomJS, but result.png
shows the login form.
var page = require('webpage').create();
page.settings.userAgent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36';
page.open('https://ent.xxxxxxx.fr/home', function (status) {
page.render('result.png');
});
$ phantomjs cookieloader.js --cookies-file=cookie.txt
How could I load those Mechanize cookies into PhantomJS, CasperJS or any other library's script?