2

So I'm writing a web crawler to batch download PDFs from my university's website, as I don't fancy downloading them one by one.

I've got most the code working, using the 'requests' module. The issue is, you have to be signed in to a university account to access the PDFs, so I've set up requests to use cookies to sign into my university account before downloading the PDFs, however the HTML form to sign in on the university page is rather peculiar.

I've abstracted the HTML which can be found here:

<form action="/login" method="post">
    <fieldset>
        <div>
            <label for="username">Username:</label>                          
            <input id="username" name="username" type="text" value="" />

            <label for="password">Password:</label>
            <input id="password" name="password" type="password" value=""/>

            <input type="hidden" name="lt" value="" />
            <input type="hidden" name="execution" value="*very_long_encrypted_code*" />
            <input type="hidden" name="_eventId" value="submit" />
            <input type="submit" name="submit" value="Login" />
        </div>
    </fieldset>
</form>

Firstly the action parameter in the form does not reference a PHP file which I don't understand. Is action="/login" referencing the page itself, or http://www.blahblah/login/login? (the HTML is taken from the page http://www.blahblah/login.

Secondly, what's with all the 'hidden' inputs? I'm not sure how this page is taking the given login data and passing it to a PHP script.

This has led to the failure of the requests sign on in my python script:

import requests
user = input("User: ")
passw = input("Password: ")
payload = {"username" : user, "password" : passw}
s = requests.Session()
s.post(loginURL, data = payload)
r = s.get(url)

I would have thought this would take the login data and sign me into the page, but r is just assigned the original logon page. I'm assuming it's to do with the strange PHP interation in the HTML. Any ideas what I need to change?

EDIT: Thought I'd also mention there is no javascript on the page at all. Purely HTML & CSS

Barmar
  • 741,623
  • 53
  • 500
  • 612
Brand0n
  • 23
  • 2
  • Yes, the action points to the page itself. There could be a rewrite on the server that sends it to a PHP script. – Barmar Apr 05 '18 at 17:41
  • Do you know how I would submit a PHP form from the requests module then? – Brand0n Apr 05 '18 at 17:41
  • Hidden inputs are used to pass along data from the script that creates the page to the script that processes the form. Just copy the values from those fields into your request payload. – Barmar Apr 05 '18 at 17:42
  • Use Beautiful Soup to get the hidden input values. – Barmar Apr 05 '18 at 17:42

1 Answers1

0

What you are looking at is likely a CSRF token

The linked answer is very good, but a summary is, these tokens used to make sure that you can't send malicious requests to a site from another page in your web browser. In this case it is a bit silly, because logging in has no consequences. It was likely added automatically by the framework your university website uses.

You will have to extract this token from the login page before doing your login POST and then include it with your data.

The full steps would be the following:

  1. Fetch the login page
  2. extract the token with e.g. BeautifulSoup or requests-html
  3. Send the login request:

    payload = {"username" : user, "password" : passw, "execution": token}

Azsgy
  • 3,139
  • 2
  • 29
  • 40
  • I tried this, initially it didn't work but I added ALL of the hidden inputs into the payload and it worked fine! Thanks – Brand0n Apr 05 '18 at 19:55
  • great! The event_id one being necessary was pretty obvious in hindsight, I guess. – Azsgy Apr 05 '18 at 19:56