0

I am trying to scrape the following record table in familysearch.org. I am using the Chrome webdriver with Python, using BeautifulSoup and Selenium.

Upon inspecting the page I am interested in, I wanted to scrape from the following bit in HTML. Note this is only one element part of a familysearch.org table that has 100 names.

<span role="cell" class="td " name="name" aria-label="Name"> <dom-if style="display: none;"><template is="dom-if"></template></dom-if> <dom-if style="display: none;"><template is="dom-if"></template></dom-if> <span><sr-cell-name name="Jame Junior " url="ZS" relationship="Principal" collection-name="Index"></sr-cell-name></span> <dom-if style="display: none;"><template is="dom-if"></template></dom-if> </span>

Alternatively, the name also shows in this bit of HTML

<a class="name" href="/ark:ZS">Jame Junior </a>

From all of this, I only want to get the name "Jame Junior", I have tried using driver.find.elements_by_class_name("name"), but it prints nothing.

This is the code I used

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
from getpass import getpass


username = input("Enter Username: " )
password = input("Enter Password: ")
chrome_path= r"C:\Users...chromedriver_win32\chromedriver.exe"
driver= webdriver.Chrome(chrome_path)
driver.get("https://www.familysearch.org/search/record/results?q.birthLikeDate.from=1996&q.birthLikeDate.to=1996&f.collectionId=...")

usernamet = driver.find_element_by_id("userName")
usernamet.send_keys(username)
passwordt = driver.find_element_by_id("password")
passwordt.send_keys(password)
login = driver.find_element_by_id("login")
login.submit()
driver.get("https://www.familysearch.org/search/record/results?q.birthLikeDate.from=1996&q.birthLikeDate.to=1996&f.collectionId=.....")
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CLASS_NAME, "name")))
#for tag in driver.find_elements_by_class_name("name"):
 #   print(tag.get_attribute('innerHTML'))

for tag in soup.find_all("sr-cell-name"):
    print(tag["name"])
rlearner
  • 75
  • 1
  • 1
  • 7
  • Do you want all the names that follow this format? or specifically only the name "Jame Junior" out of the entire page? – MendelG Jul 15 '21 at 20:58
  • Yes, I would like all of the names – rlearner Jul 15 '21 at 21:11
  • My answer includes a solution for _all_ the names, did it work? – MendelG Jul 15 '21 at 21:15
  • It did not work unfortunately, for your first suggestion using Selenium, it only says process finished with exit code 0. For the second solution it tells me that soup is undefined. I tried defining soup=(), but that did not work either. – rlearner Jul 15 '21 at 21:18
  • See my edited answer. That should solve it – MendelG Jul 15 '21 at 21:21

2 Answers2

1

Try to access the sr-cell-name tag.

Selenium:

for tag in driver.find_elements_by_tag_name("sr-cell-name"):
    print(tag.get_attribute("name"))

BeautifulSoup:

for tag in soup.find_all("sr-cell-name"):
    print(tag["name"])

EDIT: You might need to wait for the element to fully appear on the page before parsing it. You can do this using the presence_of_element_located method:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Chrome()
driver.get("...")

WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CLASS_NAME, "name")))

for tag in driver.find_elements_by_class_name("name"):
    print(tag.get_attribute('innerHTML'))
MendelG
  • 14,885
  • 4
  • 25
  • 52
  • It did ran, but it gave my the following message : Traceback (most recent call last): "File "C:\Users....py", line 22, in WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CLASS_NAME, "name"))) File "C:\Users...", line 80, in until raise TimeoutException(message, screen, stacktrace) selenium.common.exceptions.TimeoutException: Message:" I still was not able to print anything – rlearner Jul 15 '21 at 21:30
  • @rlearner Please [edit] your question to show us the full code you have tried – MendelG Jul 15 '21 at 21:38
  • Just updated the question to include the full code @MendelG – rlearner Jul 17 '21 at 03:45
0

I was looking to do something very similar and have semi-decent python/selenium scraping experience. Long story short, FamilySearch (and many other sites, I'm sure) use some kind of technology (I'm not a JS or web guy) that involves shadow host. The tags are essentially invisible to BS or Selenium.

Solution: pyshadow https://github.com/sukgu/pyshadow

You may also find this link helpful: How to handle elements inside Shadow DOM from Selenium

I have now been able to successfully find elements I couldn't before, but am still not all the way where I'm trying to get. Good luck!