1

I am trying to scrape the LinkedIn site and save all the company name on the site into a dataframe, however when i run a for loop to loop over the list element it keeps printing the first company name through out the loop

from selenium import webdriver
import os
import time
import selenium 
from selenium import webdriver
from selenium.webdriver.common.by import By
import undetected_chromedriver as uc
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import pandas as pd

url = 'https://www.linkedin.com/jobs/search/?currentJobId=3492578215&geoId=105365761&keywords=data%20analyst&location=Nigeria&refresh=true'
options = webdriver.ChromeOptions()
options.add_experimental_option('detach',True)
driver = webdriver.Chrome(r"C:\Users\i\Desktop\PPstuff\selenium\chromedriver.exe", options=options)
driver.get(url)
jobs = driver.find_elements(By.TAG_NAME,'li')
company_name = []
for job in jobs:
      company = job.find_element(By.XPATH,"//h4").text
      company_name.append(company)
      print(company)
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
Dayo Salam
  • 23
  • 2
  • I believe that `//` means "anywhere in the document". So, it's looking from the top of the document every time, and of course it find the same company each time. – John Gordon Mar 05 '23 at 22:03

3 Answers3

0

To extract the all the company names ideally you need to induce WebDriverWait for visibility_of_all_elements_located() and using List Comprehension you can use either of the following Locator Strategies:

  • Using CSS_SELECTOR and get_attribute("innerHTML"):

    print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a.job-card-list__title")))])
    
  • Using XPATH and text attribute:

    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[contains(@class, 'job-card-list__title')]")))])
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

Outro

Link to useful documentation:

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
0

I found the elements by CSS (just my preference) & I used FireFox, but Chrome should work too. I put an if condition to skip dups. This should work.

from selenium import webdriver
from selenium.webdriver.common.by import By

url = f'https://www.linkedin.com/jobs/search/?currentJobId=3492578215&geoId=105365761&keywords=data%20analyst&location=Nigeria&refresh=true'

driver = webdriver.Firefox()
company_name = []

driver.get(url)
jobs = driver.find_elements(By.CSS_SELECTOR, ".hidden-nested-link")

for job in jobs:
    #if the company name is already in the list skip it
    if job.text not in company_name:
        company = job.text
        company_name.append(company)
        print(company)
jwill
  • 84
  • 7
0

Try the below code, it prints the desired elements:

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
url = 'https://www.linkedin.com/jobs/search/?currentJobId=3492578215&geoId=105365761&keywords=data%20analyst&location=Nigeria&refresh=true'
driver.get(url)
driver.maximize_window()

companyNames = driver.find_elements(By.XPATH, '//h4/a')
for x in range(len(companyNames)):
    print(companyNames[x].text)

Console Output:

CareerMatch
Turing
Data2Bots
NewGlobe
Canonical
Turing
TEDxMaitama Official
Flutterwave
CareerMatch
Mshel Homes Limited
Jobberman Nigeria
Flutterwave
Zer0Paper
CareerMatch
Renesas Electronics
Turing
KNN Corporate Services Ltd
AppCake
Canonical
Turing
CareerMatch
Turing
Verraki Africa
Shaldag Limited
Turing

Process finished with exit code 0
Shawn
  • 4,064
  • 2
  • 11
  • 23