3

I am scraping some data for a company project, but all of it is behind the 2-factor in place by my company. The 2 factor authentication requires me to enter a code from my phone/hardware token that lasts 6 seconds. This 2-factor cannot be disabled for a variety of reasons.

Is there any way I can scrape this information? If I run it right now, BS just returns the login page (where I have to enter username/pwd before being taken to the 2 factor page).

If needed, I can also manually enter the 2-factor info (although this would have to repeated every 12 hours, so this method is not preferred). However, I have not even been able to find success with this as BeautifulSoup does not read from pre-logged in browsers and the 2 factor auth code changes every 6 seconds or so and with every login (need to go to multiple different pages, so this is basically as viable as just saving each page as html manually).

  • 1
    What are you using to download the web pages, before passing them to BS? Can't you just login using a normal browser, and copy those cookies and send them with every http request? Obviously you'll want to make sure the `User-Agent` and other headers are the same. – GordonAitchJay May 22 '20 at 07:09
  • @GordonAitchJay I'm going into Firefox and saving the page as HTML. How would you figure out which cookies to pass in? There are a ton of cookies and I'm unsure how to go about assessing which one is tied to the Duo 2-factor-auth. I presume I pass in cookies [like this](https://stackoverflow.com/questions/30048236/using-cookie-with-request)? –  May 23 '20 at 03:40
  • You want to use [Firefox's Network Monitor](https://developer.mozilla.org/en-US/docs/Tools/Network_Monitor) to see what http requests take place and what cookies are set. Yes, you can set cookies like that. I would personally use a [requests.Session](https://requests.readthedocs.io/en/master/user/advanced/#session-objects), and update the session's cookies at the start. I'm pretty sure if you just copy all the cookies you don't need to worry about find the one specifically for the auth. – GordonAitchJay May 23 '20 at 23:40
  • @GordonAitchJay That seems to be working in that it gets past the 2 factor (I don't redirect to the login page). However, when I copy the cookies exactly and pass them in [like this](https://stackoverflow.com/questions/15778466/using-python-requests-sessions-cookies-and-post), I get a ``. I presume I have to pass in `headers`. If I pass in all the `headers` (including cookies, in the same way as above), I still get ``. However, if I resend this information in browser, I get a good `200` as a response with all the data. I'm at a loss as how to proceed. –  May 24 '20 at 04:37
  • It appears you aren't correctly replicating the http requests that Firefox makes. Update your question with the code you have tried, and a screenshot of Firefox's Network Monitor with the http requests you want to replicate. – GordonAitchJay May 24 '20 at 06:30

1 Answers1

0

As commenters have noted, this depends on how your site sets and checks the login status. In addition to the method in the answer you linked, you should try the following options:

# using a session, and the cookies argument
s = requests.Session()
r = s.get('https://someurl', cookies={'somecookie': 'somecookievalue'})

# using a session, and http headers
s = requests.Session()
r = s.get('https://someurl', headers={'somekey': 'somevalue'})

In both of the above cases, the cookie is a key value pair expressed as a python dictionary. Multiple cookies can be passed as multiple key/value pairs. In some cases, auth credentials must be passed directly, like this:

s = requests.Session()
s.auth = ('user', 'pass')
s.get('https://someurl')

Lastly, some combination of two or more of these may be required. Without your code or more info about the website, it's difficult to say more. I hope all this helps.

Matt L.
  • 3,431
  • 1
  • 15
  • 28