Scraping full post Indeed with Selenium

Question

I'm trying to make a python scraper code work, but I can't, a little help would be useful, I'm still a beginner. The code runs ok, but it crashes and exports a single job to my csv, which i think it is random and does not give any error.Please, someone with more experience who can help me with some tips.Thanks in advance.

from selenium import webdriver
import pandas as pd 
from bs4 import BeautifulSoup

options = webdriver.FirefoxOptions()
driver = webdriver.Firefox()
driver.maximize_window()


df = pd.DataFrame(columns=["Title","Location","Company","Salary","Sponsored","Description"])

for i in range(25):
    driver.get('https://www.indeed.co.in/jobs?q=artificial%20intelligence&l=India&start='+str(i))
    jobs = []
    driver.implicitly_wait(20)
    

    for job in driver.find_elements_by_class_name('result'):

        soup = BeautifulSoup(job.get_attribute('innerHTML'),'html.parser')
        
        try:
            title = soup.find("a",class_="jobtitle").text.replace("\n","").strip()
            
        except:
            title = 'None'

        try:
            location = soup.find(class_="location").text
        except:
            location = 'None'

        try:
            company = soup.find(class_="company").text.replace("\n","").strip()
        except:
            company = 'None'

        try:
            salary = soup.find(class_="salary").text.replace("\n","").strip()
        except:
            salary = 'None'

        try:
            sponsored = soup.find(class_="sponsoredGray").text
            sponsored = "Sponsored"
        except:
            sponsored = "Organic"
                
        
sum_div = job.find_element_by_class_name('summary')

try:    
              sum_div.click()
except:
             close_button = driver.find_elements_by_class_name('popover-x-button-close')[0]
             close_button.click()
             sum_div.click()            
driver.implicitly_wait(2)
try:            
    job_desc = driver.find_element_by_css_selector('div#vjs-desc').text
    print(job_desc)
except:
    job_desc = 'None'   

df = df.append({'Title':title,'Location':location,"Company":company,"Salary":salary,
                        "Sponsored":sponsored,"Description":job_desc},ignore_index=True)


df.to_csv(r"C:\Users\Desktop\Python\Newtest.csv",index=False)

It seems to be an indentation issue. The code in my answer gave me CSV file with 1931 lines. — Chris, May 22 '21 at 15:22

score 0 · Accepted Answer · answered May 22 '21 at 15:21

It seems to be a simple indentation issue. A part of your code is running outside of the for loop.

from selenium import webdriver
import pandas as pd 
from bs4 import BeautifulSoup

from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

options = Options()    
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)


df = pd.DataFrame(columns=["Title","Location","Company","Salary","Sponsored","Description"])

for i in range(0,50,10):
    driver.get('https://www.indeed.co.in/jobs?q=artificial%20intelligence&l=India&start='+str(i))
    jobs = []
    driver.implicitly_wait(20)
    

    for job in driver.find_elements_by_class_name('result'):

        soup = BeautifulSoup(job.get_attribute('innerHTML'),'html.parser')
        
        try:
            title = soup.find("a",class_="jobtitle").text.replace("\n","").strip()
            
        except:
            title = 'None'

        try:
            location = soup.find(class_="location").text
        except:
            location = 'None'

        try:
            company = soup.find(class_="company").text.replace("\n","").strip()
        except:
            company = 'None'

        try:
            salary = soup.find(class_="salary").text.replace("\n","").strip()
        except:
            salary = 'None'

        try:
            sponsored = soup.find(class_="sponsoredGray").text
            sponsored = "Sponsored"
        except:
            sponsored = "Organic"


        sum_div = job.find_element_by_class_name('summary')

        try:    
                    sum_div.click()
        except:
                    close_button = driver.find_elements_by_class_name('popover-x-button-close')[0]
                    close_button.click()
                    sum_div.click()            
        driver.implicitly_wait(2)
        try:            
            job_desc = driver.find_element_by_css_selector('div#vjs-desc').text
            print(job_desc)
        except:
            job_desc = 'None'   

        df = df.append({'Title':title,'Location':location,"Company":company,"Salary":salary,
                                "Sponsored":sponsored,"Description":job_desc},ignore_index=True)

df.to_csv("test.csv",index=False)

I use Chrome instead of Firefox but I don't think the issue was there. I just indented your code correctly.

Also, it's not a good idea to put except without an excepted error. Why is "except: pass" a bad programming practice?

Thanks for the help, I tried your code in Chrome and it works very well but in Firefox the problem persists. Now give me "TabError: inconsistent use of tabs and spaces in indentation" at the try line: try sum_div.click (). I kept changing the spaces but in vain. — Darius Florea, May 22 '21 at 16:42
That error means you are using 4 spaces in some places and 1 tab in others. If you look through your code and change all 4 spaces to tabs it will resolve the error. — Chris, May 22 '21 at 20:08
@DariusFlorea is this solved your issue or answered your constion please consider marking the answer as accepted. (keep up comunity maintanence) — Chris, May 22 '21 at 20:10

Scraping full post Indeed with Selenium

1 Answers1