I am working on a project to analyze the SuperCluster Astronaut Database. I am trying to scrape the data for each astronaut into a nice, clean pandas dataframe. There is plenty of descriptive information about each astronaut available for scraping. However, when you click on the astronaut, more information is revealed - you can get a couple of paragraphs of their biography. I would like to scrape that, but need to automate some sort of action through which the link is clicked, and then the data is scraped from the page I was routed to.
Here is my attempt at that so far:
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time
data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=300&list=true&sort=launch%20order'
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
driver.maximize_window()
driver.get(url)
time.sleep(10)
bio_data = []
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.close()
tags = soup.select('.astronaut_cell.x')
for item in tags:
name = item.select_one('.bau.astronaut_cell__title.bold.mr05').get_text()
for i in name:
btn = driver.find_element_by_css_selector('cb.super_card__link_grid').click()
bio = item.select_one('px1.pb1').get_text()
bio_data.append([bio])
data.append([name,bio_data])
cols=['name','bio']
df = pd.DataFrame(data,columns=cols)
print(df)
I'm getting an error that reads:
InvalidSessionIdException: Message: invalid session id
Not sure how to resolve this issue. Can someone help point me in the right direction? Any help would be appreciated!