1

I am working on a project to analyze the SuperCluster Astronaut Database. I am trying to scrape the data for each astronaut into a nice, clean pandas dataframe. There is plenty of descriptive information about each astronaut available for scraping. However, when you click on the astronaut, more information is revealed - you can get a couple of paragraphs of their biography. I would like to scrape that, but need to automate some sort of action through which the link is clicked, and then the data is scraped from the page I was routed to.

Here is my attempt at that so far:

from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time



data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=300&list=true&sort=launch%20order'

options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
driver.maximize_window()
driver.get(url)
time.sleep(10)

bio_data = []

soup = BeautifulSoup(driver.page_source, 'lxml')
driver.close()
tags = soup.select('.astronaut_cell.x')

for item in tags:
    name = item.select_one('.bau.astronaut_cell__title.bold.mr05').get_text()
    for i in name:
        btn = driver.find_element_by_css_selector('cb.super_card__link_grid').click()
        bio = item.select_one('px1.pb1').get_text()
        bio_data.append([bio])
        
    data.append([name,bio_data])



cols=['name','bio']
df = pd.DataFrame(data,columns=cols)

print(df)

I'm getting an error that reads:

InvalidSessionIdException: Message: invalid session id

Not sure how to resolve this issue. Can someone help point me in the right direction? Any help would be appreciated!

user2813606
  • 797
  • 2
  • 13
  • 37

2 Answers2

1

InvalidSessionIdException

InvalidSessionIdException occurs incase the given session id is not in the list of active sessions, which indicates the session either does not exist or the session is not active.


This usecase

Possibly Selenium driven ChromeDriver initiated Browsing Context is geting detected as a and the session is getting terminated.


References

You can find a couple of relevant detailed discussions in:

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • Thanks for this info @undetected Selenium. I'm not sure if that's the case - since when I comment out the bio_data section - I get the names scraped. I just need help figuring out how to get the bios from the clicked pages. – user2813606 Apr 07 '22 at 19:55
1

Every link contains individual page and bio data. So no click,you have to collect each url and have to send request again to collect bio data from each/individual page.

Script:

from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time


data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=300&list=true&sort=launch%20order'

options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
driver.maximize_window()
driver.get(url)
time.sleep(5)
Name=[]
bio=[]
soup = BeautifulSoup(driver.page_source, 'lxml')
for name in soup.select('.bau.astronaut_cell__title.bold.mr05'):
    name =name.text
    Name.append(name)
    #print(name)
    urls=soup.select('a[class="astronaut_cell x"]')
    for url in urls:
        abs_url='https://www.supercluster.com'+url.get('href')
        print(abs_url)
        options = Options()
        options.add_argument("--headless")
        driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
        driver.maximize_window()
        driver.get(abs_url)
        time.sleep(5)

        soup = BeautifulSoup(driver.page_source, 'html.parser')
        driver.close()

        for astro in soup.select('div.h4')[0:8]:
            astro=astro.text
            bio.append(astro)


df = pd.DataFrame(data=list(zip(Name,bio)),columns=['name','bio'])
print(df)

Output:

      name                                                    bio
0        Nield, George                                    b. Jul 31, 1950
1         Kitchen, Jim                                              Human
2            Lai, Gary                                               Male
3          Hagle, Marc            President Commercial Space Technologies
4        Hagle, Sharon                                    b. Jul 31, 1950
..                 ...                                                ...
295  Wilcutt, Terrence                           Lead Operations Engineer
296    Linenger, Jerry                                     b. Oct 1, 1975
297      Mukai, Chiaki                                              Human
298     Thomas, Donald                                               Male
299       Chiao, Leroy  People's Liberation Army Air Force Data Missin...

[300 rows x 2 columns]
Md. Fazlul Hoque
  • 15,806
  • 5
  • 12
  • 32
  • 1
    This is fantastic! I changed the h4 reference to div.px1.py2.container--xl.mxa and got exactly what I needed. Everything else was spot on! Thank you! – user2813606 Apr 08 '22 at 20:06