0

I am trying to learn web scraping, even though I checked the examples in the documentation and some questions here at stack I cannot make my code work.

The website I want to scrape has job listings, but there is no pattern or fixed classes on it's structure, almost each element has his own id and individual classes. When I use the inspector to find the xPath of a innerHTML from a anchor tag that it's what I got:

With Firefox:

/html/body/div[1]/div/main/div[3]/div/div/section/ul/li[1]/article/header/div/div[1]/h2/a

With Brave Browser:

//*[@id="16542952"]/section/div/header/h2/a

Same url, and same element, first Job Title from results.

URL

I want to loop throug the page and get the text from some elements in the Job Listings, like the Job Title, Description, etc.

I am using selenium with Python and Firefox/geckodriver

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
Paulo Barros
  • 157
  • 1
  • 2
  • 12
  • 1
    I have checked in firefox it's giving same xpath. In brave when copying the xpath select "copy full xpath" – deadshot May 18 '20 at 20:10

2 Answers2

0

To loop throug the page and get the text of the Job Listings using Selenium and Python you have to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

  • Using CSS_SELECTOR and get_attribute():

    driver.get('https://www.catho.com.br/vagas/data-scientist/?q=data%20scientist&page=1')
    print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "header>h2>a")))])
    
  • Using XPATH and text attribute:

    driver.get('https://www.catho.com.br/vagas/data-scientist/?q=data%20scientist&page=1')
    print([my_elem.text for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//header/h2/a")))])
    
  • Console Output:

    ['Analista Data Science', 'Consultor de Data Science', 'Analista Big Data / Cientista de Dados', 'Cientista de Dados', 'Cientista de Dados', 'Cientista de Dados', 'Cientista de Dados', 'Cientista de Dados', 'Cientista de Dados', 'Cientista de Dados']
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
0

Once you have an element el, for example to get it's innerHTML you can do

el = driver.find_element('xpath', 'FULL XPATH (which FireFox gave you)')
el.get_property("innerHTML")

And about the loop, I think you could go for the parent element which "holds" the jobs elements by:

parent = driver.find_element('xpath', '/html/body/div[1]/article/section/ul') # the 'ul' which holds the jobs 'li' tags
jobs = driver.execute_script("return arguments[0].children", parent) # the parent variable will be replacing arguments[0]

for job in jobs:
    # do what you want to do to each element
n.qber
  • 354
  • 2
  • 8