So I'm writing a web crawler to batch download PDFs from my university's website, as I don't fancy downloading them one by one.
I've got most the code working, using the 'requests' module. The issue is, you have to be signed in to a university account to access the PDFs, so I've set up requests to use cookies to sign into my university account before downloading the PDFs, however the HTML form to sign in on the university page is rather peculiar.
I've abstracted the HTML which can be found here:
<form action="/login" method="post">
<fieldset>
<div>
<label for="username">Username:</label>
<input id="username" name="username" type="text" value="" />
<label for="password">Password:</label>
<input id="password" name="password" type="password" value=""/>
<input type="hidden" name="lt" value="" />
<input type="hidden" name="execution" value="*very_long_encrypted_code*" />
<input type="hidden" name="_eventId" value="submit" />
<input type="submit" name="submit" value="Login" />
</div>
</fieldset>
</form>
Firstly the action
parameter in the form does not reference a PHP file which I don't understand. Is action="/login"
referencing the page itself, or http://www.blahblah/login/login
? (the HTML is taken from the page http://www.blahblah/login
.
Secondly, what's with all the 'hidden' inputs? I'm not sure how this page is taking the given login data and passing it to a PHP script.
This has led to the failure of the requests sign on in my python script:
import requests
user = input("User: ")
passw = input("Password: ")
payload = {"username" : user, "password" : passw}
s = requests.Session()
s.post(loginURL, data = payload)
r = s.get(url)
I would have thought this would take the login data and sign me into the page, but r
is just assigned the original logon page. I'm assuming it's to do with the strange PHP interation in the HTML. Any ideas what I need to change?
EDIT: Thought I'd also mention there is no javascript on the page at all. Purely HTML & CSS