I'm attempting to login & scrape a website which obfuscates the URL.
The first step is to login on the site's login page, which I've successfully done using Python's requests
package.
The second step is to input some data into this below form and "click" Search, as displayed in the below picture. This is where I'm having difficulties.
If I do this through my browser, it'll take about 30 seconds to load, there will be a fancy graphic corresponding to the loading screen, then the full page with text will load up (this is what I wish to scrape).
When I inspect the source of the form input box from inside my browser, I see this:
<form id="companysearch_form" method="post" style="display: inline;">
<input name="csrfmiddlewaretoken" type="hidden" value="bTo9V2w4tcHsgKkIucS9c0Nkzv5rhueqEjOdQLexg3pWqApZ9Ht4xYboj6y9TIwy"/>
<div class="form-group" id="div_id_company_symbols">
<div class=""> <input class="textinput textInput form-control" id="id_company_symbols" maxlength="100" name="company_symbols" placeholder="Symbols..." required="" type="text"/>
</div>
</div>
<input id="id_honeypot" name="honeypot" type="hidden"/>
<button class="btn btn-outline-success btn-sm" type="submit">Search</button>
</form>
This is my attempt:
import sys
import requests
login_URL = '' # this is the landing page URL where I have to login
companysearch_URL = '' # this is where I get directed to automatically after I login
client = requests.session()
def get_csrftoken(client):
if 'csrftoken' in client.cookies:
csrftoken = client.cookies['csrftoken']
else:
csrftoken = client.cookies['csrf']
return csrftoken
## STEP 1 - this works brilliantly ##
client.get(login_URL)
token = get_csrftoken(client)
payload = {
'csrfmiddlewaretoken' : token,
'login':'username',
'password':'password'
}
s = client.post(login_URL, data=payload)
## STEP 2 ATTEMPT - this doesn't work. ##
payload = {
'csrfmiddlewaretoken' : token,
'company_symbols':'NVDA'
}
s = client.post(companysearch_URL, data=payload)
When I run stage 2, the s = client.post(companysearch_URL, data=payload)
will run in one second, so I immediately know it didn't work. I then read the text of s
and notice that there is no data at all.
Headers from s
:
{'Content-Type': 'text/html; charset=utf-8', 'Date': 'Thu, 17 Dec 2020 20:27:48 GMT', 'Server': 'Apache/2.4.46 (Amazon) mod_wsgi/3.5 Python/3.6.12', 'Set-Cookie': 'csrftoken=K01sxNa1M1st9BlpwAdlyEu1VIudSek9PCGhZViozNiqbROnC62YuzHzExoRVAZM; expires=Thu, 16 Dec 2021 20:27:48 GMT; Max-Age=31449600; Path=/; SameSite=Lax; Secure', 'Vary': 'Cookie,X-Forwarded-Proto', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'DENY', 'Content-Length': '12298', 'Connection': 'keep-alive'}