0

I'm trying to download some files from behind a SSO (Single Sign-On) site. It seems to be SAML authenticated, that's where I'm stuck. Once authenticated I'll be able to perform API requests that return JSON, so no need to interpret/scrape.

Not really sure how to deal with that in mechanicalsoup (and relatively unfamiliar with web-programming in general), help would be much appreciated.

Here's what I've got so far:

import mechanicalsoup
from getpass import getpass
import json

login_url = ...
br = mechanicalsoup.StatefulBrowser()
response = br.open(login_url)
if verbose: print(response)

# provide the username + password
br.select_form('form[id="loginForm"]')
print(br.get_current_form().print_summary()) # Just to see what's there. 
br['UserName'] = input('Email: ')
br['Password'] = getpass()
response = br.submit_selected().text
if verbose: print(response)

At this point I get a page telling me javascript is disabled and that I must click submit to continue. So I do:

br.select_form()
response = br.submit_selected().text
if verbose: print(response)

That's where I get a complaint about state information being lost.

Output:

<h2>State information lost</h2>

State information lost, and no way to restart the request<h3>Suggestions for resolving this problem:</h3><ul><li>Go back to the previous page and try again.</li><li>Close the web browser, and try again.</li></ul><h3>This error may be caused by:</h3><ul><li>Using the back and forward buttons in the web browser.</li><li>Opened the web browser with tabs saved from the previous session.</li><li>Cookies may be disabled in the web browser.</li></ul>

The only hits I've found on scraping behind SAML logins are all going with a selenium approach (and sometimes dropping down to requests).

Is this possible with mechanicalsoup?

dabell
  • 60
  • 10
  • If the SSO requires javascript, then MechanicalSoup may not be appropriate, because it doesn't support javascript (see https://mechanicalsoup.readthedocs.io/en/stable/faq.html#form-submission-has-no-effect-or-fails). Based on the intermediate page you're getting, it kind of sounds like the site's fallback for when javascript is disabled isn't working correctly. – Daniel Hemberger Jan 28 '20 at 22:13
  • Thanks @DanielHemberger, I was under the impression it was an auth/cookie sort of problem. Any advice on how to check the fallback? Running selenium with javascript and redirects disabled perhaps? – dabell Jan 29 '20 at 15:18
  • You can actually test it with just your web browser, no Selenium needed. (In Chrome, see https://developers.google.com/web/tools/chrome-devtools/javascript/disable.) – Daniel Hemberger Jan 29 '20 at 23:38
  • Well, that showed up a page I hadn't seen before. `"The Duo Access Gateway requires JavaScript to protect users against Cross-Site Request Forgery attacks. Please enable JavaScript in your browser to proceed." ` I know MechanicalSoup does not do Javascript, so this appears to be a dead-end. I'd hoped I could avoid selenium's overhead. For those looking at the same thing in the future I ended up logging in with chromedriver in selenium then dropping to requests once the authentication was complete. Following the approach found [here](https://stackoverflow.com/a/54087929/5874274) – dabell Jan 31 '20 at 20:14

1 Answers1

0

My situation turned out to require Javascript for login. My original question about getting into SAML auth was not the true environment. So this question has not truly been answered.
Thanks to @Daniel Hemberger for helping me figure that out in the comments.

In this situation MechanicalSoup is not the correct tool (due to Javascript) and I ended up using selenium to get through authenication then using requests.

dabell
  • 60
  • 10