Using Selenium to click page and scrape Info from routed page

Question

I am working on a project to analyze the SuperCluster Astronaut Database. I am trying to scrape the data for each astronaut into a nice, clean pandas dataframe. There is plenty of descriptive information about each astronaut available for scraping. However, when you click on the astronaut, more information is revealed - you can get a couple of paragraphs of their biography. I would like to scrape that, but need to automate some sort of action through which the link is clicked, and then the data is scraped from the page I was routed to.

Here is my attempt at that so far:

from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time



data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=300&list=true&sort=launch%20order'

options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
driver.maximize_window()
driver.get(url)
time.sleep(10)

bio_data = []

soup = BeautifulSoup(driver.page_source, 'lxml')
driver.close()
tags = soup.select('.astronaut_cell.x')

for item in tags:
    name = item.select_one('.bau.astronaut_cell__title.bold.mr05').get_text()
    for i in name:
        btn = driver.find_element_by_css_selector('cb.super_card__link_grid').click()
        bio = item.select_one('px1.pb1').get_text()
        bio_data.append([bio])
        
    data.append([name,bio_data])



cols=['name','bio']
df = pd.DataFrame(data,columns=cols)

print(df)

I'm getting an error that reads:

InvalidSessionIdException: Message: invalid session id

Not sure how to resolve this issue. Can someone help point me in the right direction? Any help would be appreciated!

score 1 · Answer 1 · answered Apr 07 '22 at 18:44

InvalidSessionIdException

InvalidSessionIdException occurs incase the given session id is not in the list of active sessions, which indicates the session either does not exist or the session is not active.

This usecase

Possibly Selenium driven ChromeDriver initiated google-chrome-headless Browsing Context is geting detected as a bot and the session is getting terminated.

References

You can find a couple of relevant detailed discussions in:

selenium.common.exceptions.WebDriverException: Message: invalid session id using Selenium with ChromeDriver and Chrome through Python

Thanks for this info @undetected Selenium. I'm not sure if that's the case - since when I comment out the bio_data section - I get the names scraped. I just need help figuring out how to get the bios from the clicked pages. — user2813606, Apr 07 '22 at 19:55

Md. Fazlul Hoque · Accepted Answer · 2022-04-08T13:02:55.737

Every link contains individual page and bio data. So no click,you have to collect each url and have to send request again to collect bio data from each/individual page.

Script:

from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time


data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=300&list=true&sort=launch%20order'

options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
driver.maximize_window()
driver.get(url)
time.sleep(5)
Name=[]
bio=[]
soup = BeautifulSoup(driver.page_source, 'lxml')
for name in soup.select('.bau.astronaut_cell__title.bold.mr05'):
    name =name.text
    Name.append(name)
    #print(name)
    urls=soup.select('a[class="astronaut_cell x"]')
    for url in urls:
        abs_url='https://www.supercluster.com'+url.get('href')
        print(abs_url)
        options = Options()
        options.add_argument("--headless")
        driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
        driver.maximize_window()
        driver.get(abs_url)
        time.sleep(5)

        soup = BeautifulSoup(driver.page_source, 'html.parser')
        driver.close()

        for astro in soup.select('div.h4')[0:8]:
            astro=astro.text
            bio.append(astro)


df = pd.DataFrame(data=list(zip(Name,bio)),columns=['name','bio'])
print(df)

Output:

      name                                                    bio
0        Nield, George                                    b. Jul 31, 1950
1         Kitchen, Jim                                              Human
2            Lai, Gary                                               Male
3          Hagle, Marc            President Commercial Space Technologies
4        Hagle, Sharon                                    b. Jul 31, 1950
..                 ...                                                ...
295  Wilcutt, Terrence                           Lead Operations Engineer
296    Linenger, Jerry                                     b. Oct 1, 1975
297      Mukai, Chiaki                                              Human
298     Thomas, Donald                                               Male
299       Chiao, Leroy  People's Liberation Army Air Force Data Missin...

[300 rows x 2 columns]

This is fantastic! I changed the h4 reference to div.px1.py2.container--xl.mxa and got exactly what I needed. Everything else was spot on! Thank you! — user2813606, Apr 08 '22 at 20:06

Using Selenium to click page and scrape Info from routed page

2 Answers2

InvalidSessionIdException

This usecase

References