I am working on data scraping and machine learning. I am new to both Python and Scraping. I am trying to scrape this particular site.
From what I have monitored they execute several scripts in between login and next page. Hence they get those table data. I am able to successfully login and then with session get the data from next page as well, what I am missing is getting that data which they get from executing script in between. I need the data from table
satcat
and achieve pagination. Following is my code
import requests
from bs4 import BeautifulSoup
import urllib
from urllib.request import urlopen
import html2text
import time
from requests_html import HTMLSession
from requests_html import AsyncHTMLSession
with requests.Session() as s:
#s = requests.Session()
session = HTMLSession()
url = 'https://www.space-track.org/'
headers = {'User-Agent':'Mozilla/5.0(X11; Ubuntu; Linux x86_64; rv:66.0)Gecko/20100101 Firefox/66.0'}
login_data = { "identity": "",
"password": "",
"btnLogin": "LOGIN"
}
login_data_extra={"identity": "", "password": ""}
preLogin = session.get(url + 'auth/login', headers=headers)
time.sleep(3)
print('*******************************')
print('\n')
print('data to retrive csrf cookie')
#print(preLogin.text)
#soup = BeautifulSoup(preLogin.content,'html.parser')
#afterpretty = soup.prettify()
#login_data['spacetrack_csrf_token'] = soup.find('input',attrs={'name':'spacetrack_csrf_token'})['value']
csrf = dict(session.cookies)['spacetrack_csrf_cookie']
#csrf = p.headers['Set-Cookie'].split(";")[0].split("=")[-1]
login_data['spacetrack_csrf_token'] = csrf
#print(login_data)
# html = open(p.content).read()
# print (html2text.html2text(p.text))
#login_data['spacetrack_csrf_token'] = soup.find('spacetrack_csrf_token"')
#print(login_data)
login = session.post(url+'auth/login',data=login_data,headers=headers,allow_redirects=True)
time.sleep(1)
print('****************************************')
print('\n')
print('login api status code')
print(login.url)
#print(r.url)
#print(r.content)
print('******************************')
print(' ')
print(' ')
print('\n')
print('data post login')
#async def get_pyclock():
# r = await session.get(url)
# await r.html.arender()
# return r
#postLogin = session.run(get_pyclock)
time.sleep(3)
postLogin = session.get(url)
postLogin.html.render(sleep=5, keep_page=True)
As you can see I have used requests_html library to render the html, but I have been unsuccessful in getting the data. This is the url executed in js internally which gets my data
Can anyone help me with how to scrape that data or javascript ?
Thank you :)