I'm trying to download some files from behind a SSO (Single Sign-On) site. It seems to be SAML authenticated, that's where I'm stuck. Once authenticated I'll be able to perform API requests that return JSON, so no need to interpret/scrape.
Not really sure how to deal with that in mechanicalsoup (and relatively unfamiliar with web-programming in general), help would be much appreciated.
Here's what I've got so far:
import mechanicalsoup
from getpass import getpass
import json
login_url = ...
br = mechanicalsoup.StatefulBrowser()
response = br.open(login_url)
if verbose: print(response)
# provide the username + password
br.select_form('form[id="loginForm"]')
print(br.get_current_form().print_summary()) # Just to see what's there.
br['UserName'] = input('Email: ')
br['Password'] = getpass()
response = br.submit_selected().text
if verbose: print(response)
At this point I get a page telling me javascript is disabled and that I must click submit to continue. So I do:
br.select_form()
response = br.submit_selected().text
if verbose: print(response)
That's where I get a complaint about state information being lost.
Output:
<h2>State information lost</h2>
State information lost, and no way to restart the request<h3>Suggestions for resolving this problem:</h3><ul><li>Go back to the previous page and try again.</li><li>Close the web browser, and try again.</li></ul><h3>This error may be caused by:</h3><ul><li>Using the back and forward buttons in the web browser.</li><li>Opened the web browser with tabs saved from the previous session.</li><li>Cookies may be disabled in the web browser.</li></ul>
The only hits I've found on scraping behind SAML logins are all going with a selenium approach (and sometimes dropping down to requests).
Is this possible with mechanicalsoup?