0

I am trying to get the text data from the below website:-

https://www.lemoyne.edu/Give/Information-for-Donors/Honor-Roll/1954

I am not able to get text maybe because of span tag location

Any suggestion/help would be appreciated. Thanks in advance!!


driver = webdriver.Chrome("chromedriver.exe")
driver.maximize_window()
driver.get("https://www.lemoyne.edu/Give/Information-for-Donors/Honor-Roll/1954")
time.sleep(10)

donors= driver.find_elements("xpath",'//div[@class = "container"]/div[@class="donorcolumn"]/p')
donors

##Result:- Empty List []

for donor in donors:
   print(donor.get_attribute("innerHTML"))

##Result:- Empty List []

for donor in donors:
   print(donor.text)

## Result:- Empty List []

Expectation:-

The Hon. Salvatore J. Arrigo Jr. '54 and Mrs. Elizabeth J. Arrigo (35) President's Club Annual Fund Previous President's Club Member
Margaret A. Dwyer '54, L.C.H.D. '94 (35) President's Club Previous President's Club Member
Frances Morrison Scott Estate (1) Previous President's Club Member
Rosemary T. Fatcheric '54 (12)
Jo-An Feyerabend '54 (35)
James H. Greiner '54 (11) Annual Fund
Charles R. Nojaim '54 and Patricia Nojaim (22)
Marie Dinehart Rathbun '54 (22)
Audrey Zillioux Rich '54 (30) Annual Fund
David G. Schoeneck '54 and Therese Sharpe Schoeneck '54 (25) Annual Fund
John H. Senecal '54 (14)
John B. Vita '54 and Mary M. Vita (1)
Eugene P. Vukelic '54 (8) President's Club Annual Fund Previous President's Club Member
Shawn
  • 4,064
  • 2
  • 11
  • 23

3 Answers3

1

That data is in an iframe. If you are keen on using Selenium, you first need to switch to that iframe, and then get the data from it. Here is an alternative (lighter and simpler) way to get that data, by scraping the iframe source directly:

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
}
url = 'https://s3.amazonaws.com/lemoynehonorroll/1954.html'

r = requests.get(url, headers=headers)

soup = bs(r.text, 'html.parser')
donors_list = [x.get_text(strip=True, separator=' ') for x in soup.select('div[class="donorcolumn"] p')]
print(donors_list)

Result in terminal:

["The Hon. Salvatore J. Arrigo Jr. '54 and Mrs. Elizabeth J. Arrigo (35)",
 "Margaret A. Dwyer '54, L.C.H.D. '94 (35)",
 'Frances Morrison Scott Estate (1)',
 "Rosemary T. Fatcheric '54 (12)",
 "Jo-An Feyerabend '54 (35)",
 "James H. Greiner '54 (11)",
 "Charles R. Nojaim '54 and Patricia Nojaim (22)",
 "Marie Dinehart Rathbun '54 (22)",
 "Audrey Zillioux Rich '54 (30)",
 "David G. Schoeneck '54 and Therese Sharpe Schoeneck '54 (25)",
 "John H. Senecal '54 (14)",
 "John B. Vita '54 and Mary M. Vita (1)",
 "Eugene P. Vukelic '54 (8)"]
Barry the Platipus
  • 9,594
  • 2
  • 6
  • 30
1

If you notice the HTML, desired elements are wrapped withing an iframe, you need to switch into the frame and then perform other actions, use below code to switch to iframe:

wait = WebDriverWait(driver, 10)
wait.until(EC.frame_to_be_available_and_switch_to_it((By.ID, "dnn_ctr11646_IFrame_htmIFrame")))

Full code:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.maximize_window()
driver.get("https://www.lemoyne.edu/Give/Information-for-Donors/Honor-Roll/1954")
wait = WebDriverWait(driver, 10)
wait.until(EC.frame_to_be_available_and_switch_to_it((By.ID, "dnn_ctr11646_IFrame_htmIFrame")))
donors = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class = 'container']/div[@class='donorcolumn']/p")))

for donor in donors:
    print(donor.get_attribute("innerHTML"))

for donor in donors:
    print(donor.text)
Shawn
  • 4,064
  • 2
  • 11
  • 23
1

The desired elements are within an <iframe> so you have to:

  • Induce WebDriverWait for the desired frame to be available and switch to it.

  • To extract the texts you can use list comprehension and you can use either of the following locator strategies:

    • Using CSS_SELECTOR:

      driver.get("https://www.lemoyne.edu/Give/Information-for-Donors/Honor-Roll/1954")
      WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR, "iframe[id$='IFrame_htmIFrame']")))
      print([my_elem.text for my_elem in driver.find_elements(By.CSS_SELECTOR, "div.donorcolumn p")])
      
    • Using XPATH:

      driver.get("https://www.lemoyne.edu/Give/Information-for-Donors/Honor-Roll/1954")
      WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR, "iframe[id$='IFrame_htmIFrame']")))
      print([my_elem.text for my_elem in driver.find_elements(By.XPATH, "//div[@class='donorcolumn']//p")])
      
  • Note : You have to add the following imports :

     from selenium.webdriver.support.ui import WebDriverWait
     from selenium.webdriver.common.by import By
     from selenium.webdriver.support import expected_conditions as EC
    
  • Console Output:

    ["The Hon. Salvatore J. Arrigo Jr. '54 and Mrs. Elizabeth J. Arrigo (35)", "Margaret A. Dwyer '54, L.C.H.D. '94 (35)", 'Frances Morrison Scott Estate (1)', "Rosemary T. Fatcheric '54 (12)", "Jo-An Feyerabend '54 (35)", "James H. Greiner '54 (11)", "Charles R. Nojaim '54 and Patricia Nojaim (22)", "Marie Dinehart Rathbun '54 (22)", "Audrey Zillioux Rich '54 (30)", "David G. Schoeneck '54 and Therese Sharpe Schoeneck '54 (25)", "John H. Senecal '54 (14)", "John B. Vita '54 and Mary M. Vita (1)", "Eugene P. Vukelic '54 (8)"]
    

Reference

You can find a couple of relevant discussions in:

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352