I am scraping LAN data using BeautifulSoup4 and Python requests for a company project. Since the site has a login interface, I am not authorized to access the data. The login interface is a pop-up that doesn't allow me to access the page source or inspect the page elements without log in. the error I get is this-
Access Error: Unauthorized Access to this document requires a User ID
This is a screen-shot of the pop-up box (The blackened part is sensitive information). It has not information about the html tags at all, hence I cannot auto-login via python.
I have tried requests_ntlm, selenium, python requests and even ParseHub but it did not work. I have been stuck in this phase for a month now! Please, any help would be appreciated.
Below is my initial code:
import requests
from requests_ntlm import HttpNtlmAuth
from bs4 import BeautifulSoup
r = requests.get("www.amazon.in")
from urllib.request import Request, urlopen
req = Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
print r.content
r = requests.get("www.amazon.in",auth=HttpNtlmAuth('user_name','passwd'))
print r.content*
s_data = BeautifulSoup(r.content,"lxml")*
print s_data.content
Error:
Document Error: Unauthorized
Access to this document requires a User IDAccess Error: Unauthorized
This is the error I get when BeautifulSoup tries to access the data after I have manually logged into the site.