What information do I need when scraping a website that requires logging in?

Question

I want to access my business' database on some site and scrape it using Python (I'm using Requests and BS4, I can go further if needed). But I couldn't. Can someone provide us with info and simple resources on how to scrape such sites.

I'm not talking about providing usernames and passwords. The site requires much more than this. How do I know the info I am required to provide for my script aside of UN and PW(e.g. how do I know that I must provide, say, an auth token)?

How to deal with the site when there are no HTTP URLs, but hrefs in the form of javascript:__doPostBack?

And in this regard, how do I transit from the logging in page to the page I want (the one contained in the aforementioned mentioned javascript:__doPostBack)?

Are the libraries I'm using enough? or do you recommend using—and learning in my case—something else?

Your help is greatly appreciated and thanked.

score 0 · Answer 1 · answered Aug 01 '18 at 20:02

You didn't mention what you use for scraping, but since this sounds like a lot of the interaction on this site is based on client-side code, I'd suggest using a real browser to do the scraping, and interacting with the site not using low-level HTTP requests but using client side interaction (such as typing in elements or clicking buttons). This way, you don't need to worry about what form data to send or how to get the URLs of links yourself.

One recommended method of doing this would be to use BeutifulSoup with Selenium / WebDriver. There are multiple resources on how to do this, for example: How can I parse a website using Selenium and Beautifulsoup in python?

I suggest you read up on Selenium, try it, and find out for yourself. — shevron, Aug 02 '18 at 06:31

What information do I need when scraping a website that requires logging in?

1 Answers1