0

I am working on data scraping and machine learning. I am new to both Python and Scraping. I am trying to scrape this particular site.

https://www.space-track.org/

From what I have monitored they execute several scripts in between login and next page. Hence they get those table data. I am able to successfully login and then with session get the data from next page as well, what I am missing is getting that data which they get from executing script in between. I need the data from table

satcat

and achieve pagination. Following is my code

 import requests
from bs4 import BeautifulSoup
import urllib
from urllib.request import urlopen
import html2text
import time
from requests_html import HTMLSession
from requests_html import AsyncHTMLSession
with requests.Session() as s:
    #s = requests.Session()
    session = HTMLSession()

    url = 'https://www.space-track.org/'
    headers = {'User-Agent':'Mozilla/5.0(X11; Ubuntu; Linux x86_64; rv:66.0)Gecko/20100101 Firefox/66.0'}
    login_data = { "identity": "",
         "password": "",
         "btnLogin": "LOGIN"
     }
    login_data_extra={"identity": "", "password": ""}
    preLogin = session.get(url + 'auth/login', headers=headers)
    time.sleep(3)
    print('*******************************')
    print('\n')
    print('data to retrive csrf cookie')
    #print(preLogin.text)
    #soup = BeautifulSoup(preLogin.content,'html.parser')
    #afterpretty = soup.prettify()
    #login_data['spacetrack_csrf_token'] = soup.find('input',attrs={'name':'spacetrack_csrf_token'})['value']
    csrf = dict(session.cookies)['spacetrack_csrf_cookie']
    #csrf = p.headers['Set-Cookie'].split(";")[0].split("=")[-1]
    login_data['spacetrack_csrf_token'] = csrf
    #print(login_data)
   # html = open(p.content).read()
   # print (html2text.html2text(p.text))    

    #login_data['spacetrack_csrf_token'] = soup.find('spacetrack_csrf_token"')
    #print(login_data)

    login = session.post(url+'auth/login',data=login_data,headers=headers,allow_redirects=True)
    time.sleep(1)

    print('****************************************')
    print('\n')
    print('login api status code')
    print(login.url)
    #print(r.url)
    #print(r.content)
    print('******************************')
    print(' ')
    print(' ')
    print('\n')
    print('data post login')
    #async def get_pyclock():
    # r = await session.get(url)
    # await r.html.arender()
    # return r
    #postLogin  = session.run(get_pyclock)




    time.sleep(3)
    postLogin = session.get(url)
    postLogin.html.render(sleep=5, keep_page=True)

As you can see I have used requests_html library to render the html, but I have been unsuccessful in getting the data. This is the url executed in js internally which gets my data

https://www.space-track.org/master/loadSatCatData

Can anyone help me with how to scrape that data or javascript ?

Thank you :)

Pritish
  • 1,284
  • 1
  • 19
  • 42

1 Answers1

1

You can go for selenium. It has a function browser.execute_script(). This will help you to execute script. Hope this helps :)

Debdut Goswami
  • 1,301
  • 12
  • 28
  • Does it execute all the scripts ? – Pritish Dec 06 '19 at 17:22
  • yeah I guess so. I have never encountered any error. I haven't read the documentation tbh. I am suggesting this from my personal experience. – Debdut Goswami Dec 06 '19 at 17:22
  • Thank you, my concern with selenium is , what if I want to ship this code as script to someone else, how will I ship the driver ? Can you throw some light on it ? – Pritish Dec 06 '19 at 17:24
  • you can easily add the chrome driver in the working directory and ship the directory. And if your concern is regarding production then you can convert the python script to an executable using `pyinstaller` – Debdut Goswami Dec 06 '19 at 17:26
  • That is great, thank you, will try and let you know :) – Pritish Dec 06 '19 at 17:27
  • Sure. And do comment here if you encounter any error with `browser.execute_script()` – Debdut Goswami Dec 06 '19 at 17:28
  • Sure,will try it out tomorrow,including installation. Thank you for the response. – Pritish Dec 06 '19 at 17:30
  • Can you please tell me how to pass the session from scrapy to selenium as I am loging in with selenium – Pritish Dec 07 '19 at 21:16
  • Why are you using scrapy at all? The combination of selenium and beautiful soup should do the work. – Debdut Goswami Dec 08 '19 at 05:23
  • I am fairly new to it ,rather I was into Android, and don't have experience even in building websites :) so all Greek and Latin for me :) can you please give me some concrete example like some link of example with login and scrape after it, for instance I found out that selenium requires chrome of specific version to be installed, and it launches the browser, so how to scale it as a project, when I give my python script as an executable or something like that ? This script is to be run everyday at a certain time, so will it launch the browser and how will it run without UI ?? – Pritish Dec 08 '19 at 05:31
  • You can make chrome headless, i.e. it won't open the browser. I'm providing a good link for you to get started. – Debdut Goswami Dec 08 '19 at 07:09
  • “Web Scraping using Beautiful Soup and Selenium for dynamic page” by rahul nayak https://link.medium.com/5xFJYy9gf2 This will give a good headstart on when to use selenium and when to use beautiful soup and when both. – Debdut Goswami Dec 08 '19 at 07:11
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/203838/discussion-between-pritish-and-debdut-goswami). – Pritish Dec 08 '19 at 10:32
  • Hi, I am able to do it now , so can you please let me know how to run this on server ? like how do I ship the chrome binary ?? Is it required to be there on server ? – Pritish Dec 11 '19 at 06:50
  • Use `pyinstaller` to package the file along with the chrome driver. – Debdut Goswami Dec 11 '19 at 07:49
  • Chrome Driver is there, but what about binary ? That is something which I get only after I install chrome on my linux machine ?? So how to manage that ? Like do I need to install chrome on server,of exact same version ? And what if I want to host it on server ? I have asked a separate question for it as well, if you want to answer. https://stackoverflow.com/questions/59279888/how-to-create-an-application-of-python-selenium-with-its-drivers-shipped :) – Pritish Dec 11 '19 at 08:26
  • I won't comment on that now. I'll try it first and then I'll let you know. – Debdut Goswami Dec 11 '19 at 12:40