Get innerHTML with xpath in selenium with python

Question

I am trying to learn web scraping, even though I checked the examples in the documentation and some questions here at stack I cannot make my code work.

The website I want to scrape has job listings, but there is no pattern or fixed classes on it's structure, almost each element has his own id and individual classes. When I use the inspector to find the xPath of a innerHTML from a anchor tag that it's what I got:

With Firefox:

/html/body/div[1]/div/main/div[3]/div/div/section/ul/li[1]/article/header/div/div[1]/h2/a

With Brave Browser:

//*[@id="16542952"]/section/div/header/h2/a

Same url, and same element, first Job Title from results.

URL

I want to loop throug the page and get the text from some elements in the Job Listings, like the Job Title, Description, etc.

I am using selenium with Python and Firefox/geckodriver

I have checked in firefox it's giving same xpath. In brave when copying the xpath select "copy full xpath" — deadshot, May 18 '20 at 20:10

score 0 · Answer 1 · answered May 18 '20 at 20:21

To loop throug the page and get the text of the Job Listings using Selenium and Python you have to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

Using CSS_SELECTOR and get_attribute():

driver.get('https://www.catho.com.br/vagas/data-scientist/?q=data%20scientist&page=1')
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "header>h2>a")))])

Using XPATH and text attribute:

driver.get('https://www.catho.com.br/vagas/data-scientist/?q=data%20scientist&page=1')
print([my_elem.text for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//header/h2/a")))])

Console Output:

['Analista Data Science', 'Consultor de Data Science', 'Analista Big Data / Cientista de Dados', 'Cientista de Dados', 'Cientista de Dados', 'Cientista de Dados', 'Cientista de Dados', 'Cientista de Dados', 'Cientista de Dados', 'Cientista de Dados']

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

score 0 · Answer 2 · answered May 18 '20 at 20:22

Once you have an element el, for example to get it's innerHTML you can do

el = driver.find_element('xpath', 'FULL XPATH (which FireFox gave you)')
el.get_property("innerHTML")

And about the loop, I think you could go for the parent element which "holds" the jobs elements by:

parent = driver.find_element('xpath', '/html/body/div[1]/article/section/ul') # the 'ul' which holds the jobs 'li' tags
jobs = driver.execute_script("return arguments[0].children", parent) # the parent variable will be replacing arguments[0]

for job in jobs:
    # do what you want to do to each element

Get innerHTML with xpath in selenium with python

2 Answers2