2

When using Selenium to retrieve data from indeed.com, I have noticed a strange behaviour that I still cannot explain.

Introduction

The urls have the following format

https://it.indeed.com/jobs?q=Call%20Center&sort=date&start=10&vjk=f19d9d0e655cfd87

Which is

https://it.indeed.com/jobs
    ?q=[ QUERY ]
    &sort=[ SORT CRITERIA ]
    &start=[ FIRST RESULT* ]
    &vjk=[ ID OF JOB HIGHLIGHTED ] # Can be omitted 

# First result: if this value is 10, the page will shows the results starting from the 10th object

You can browse different pages of results.

You can browse different pages of results.

When you open a second page (does not matter where you start from), a popup will appear, proposing you to register to the newsletter.

The popup

The strange, unexpected behaviour

Whenever I scrape my second page, and the popup appears, I can only scrape a portion of the results.

My first assumption:

  • Maybe part of the results is hidden / not loaded when the popup appears But looking at the page with Chrome, I could confirm that the page contains all of the data I need

Second assumption:

  • The page needs more time to load But increasing the waiting time did not solve the issue

I do not understand what is happening, could you help?

My Code

# Get the driver
driver = get_driver("YOURPATHTO/chromedriver")
driver.implicitly_wait(5)

url_indeed = lambda x: f"https://it.indeed.com/jobs?q=Call%20Center&sort=date&start={x}"

list_jobs = []

# Let's get the first 10 pages
for i in range(0, 1000, 10):
    current_jobs = []
 
    # Get the page
    driver.get(url_indeed(i))

    # Parse the single jobs / the single results
    jobs = driver.find_elements_by_xpath("//div[contains(@class, 'result') and contains(@class, 'job_')]")

    for counter in range(len(jobs)):
        job = jobs[counter]
        dictio = {}
        print("___ ___ ___")
        print(job.text) # Debug
        search1 = job.find_elements_by_xpath(".//div[contains(@class, 'topLeft')]")
        search2 = job.find_elements_by_xpath(".//span[contains(@id,'jobTitle')]")
        search3 = job.find_elements_by_xpath(".//span[@class='companyName']")
        search4 = job.find_elements_by_xpath(".//div[@class='companyLocation']")

        dictio["extra"] = search1[0].text
        dictio["work"] = search2[0].text
        dictio["company"] = search3[0].text
        dictio["place"] = search4[0].text

        if dictio["company"] == '':
            print(":(") # Debug
            pass
        
        current_jobs.append(dictio)

    print(current_jobs)
    print(len(current_jobs))
    list_jobs.extend(current_jobs)

My output (at the second iteration of the loop)

current_jobs

The expected output...

There should be no results missing like this. It is almost like there is the expected HTML but with no text inside of it.

No jobs should be missing

EDIT

FYI I have tried already to close the popup, but it does not solve the issue. Also, you can manually verify that the html document is not lacking anything, even while the popup is present.

Federico Dorato
  • 710
  • 9
  • 27
  • I can reproduce your issue manually - i can post you a response - but Is there more to your code? - you have `for counter in range(len(jobs)):` - but `jobs` is undefined? - i'm expecting a line before there which i assume counts the number of jobs on the page? – RichEdwards Jun 16 '22 at 10:45
  • Probably ignore that last comment - i've added in something in an answer that seems to work. Ask if you have questions – RichEdwards Jun 16 '22 at 11:28
  • @RichEdwards you are right, I've accidentally removed one important row, my apologies – Federico Dorato Jun 16 '22 at 13:31

2 Answers2

2

I got it working with a few changes to the code.

Including a couple of changes didn't ask for too...

1/ From this:

# Let's get the first 10 pages
for i in range(0, 1000, 10):

To this:

pageStart = 0 # the starting page
pageRange = 3 # num of pages. 5 == first 5 pages - lowered to debug
pageSteps = 10 # num of results to increment
for i in range(0, (pageRange * pageSteps), pageSteps):

There's a concept called "magic numbers" - sometimes it's fine, but in this instance when I fixed your code I was doing 100 pages (not 10). A calculated value makes it easier to see and control with less maintenance. I also reuse pageSteps and that saves another magic number further in the code.

2/ Jobs is undefined in your provided - so I've added this in the middle:

    # I've added this line - i assume this is what you're doing?
    jobs = driver.find_elements(By.XPATH, "//div[@class='job_seen_beacon']")

3/ Even though the selenium docs advise against it - modified the solution to use both wait times (Explicit and implicit) with the same time:

waitTime = 10
driver.implicitly_wait(waitTime)
## selenium docs saynot to mix both waits as you CAN get unprecitable waits 
## but the same time is OK - it's 
wait = WebDriverWait(driver, waitTime) 

The docs say not to because it can result in unpredictable time waited. If you control it like this you mitigate it. This code still runs fast and when it runs right there's no issue.

4/ Then we have the solution to your question.

I have to close the google popup and newsletter popup. You might not need the google line. My machine resolution is smaller than yours, and if I don't close the google popup I get click intercepted errors. This is how my machine looks: enter image description here

This is how it's handled:

# On the second page - i.e. when i has increase by the increment
    # This will run the first time the page is incremented
    #this only needs to be done once
    if i==pageSteps:
        driver.find_element(By.XPATH, '//div[@class="google-Only-Modal-Upper-Row"]//button[@aria-label="Close"]').click()
        closeButtonXpath = "//div[@id='popover-x']/button"
        wait.until(EC.element_to_be_clickable((By.XPATH, closeButtonXpath))).click()
        wait.until(EC.invisibility_of_element((By.XPATH, closeButtonXpath)))

It only needs to run once, on the second page (when i has been incremented by pageSteps).

#######################################

This is everything put together:


from webbrowser import Chrome
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By


# Get the driver
driver = webdriver.Chrome() # note i modified this to my driver
waitTime = 10
driver.implicitly_wait(waitTime)
## selenium docs saynot to mix both waits as you CAN get unprecitable waits 
## but the same time is OK - it's 
wait = WebDriverWait(driver, waitTime) 

url_indeed = lambda x: f"https://it.indeed.com/jobs?q=Call%20Center&sort=date&start={x}"

list_jobs = []

pageStart = 0 # the starting page
pageRange = 3 # num of pages. 5 == first 5 pages - lowered to debug
pageSteps = 10 # num of results to increment
for i in range(0, (pageRange * pageSteps), pageSteps):
    current_jobs = []
 
    # Get the page
    driver.get(url_indeed(i))


    # On the second page - i.e. when i has increase by the increment
    # This will run the first time the page is incremented
    #this only needs to be done once
    if i==pageSteps:
        driver.find_element(By.XPATH, '//div[@class="google-Only-Modal-Upper-Row"]//button[@aria-label="Close"]').click()
        closeButtonXpath = "//div[@id='popover-x']/button"
        wait.until(EC.element_to_be_clickable((By.XPATH, closeButtonXpath))).click()
        wait.until(EC.invisibility_of_element((By.XPATH, closeButtonXpath)))

    # I've added this line - i assume this is what you're doing?
    jobs = driver.find_elements(By.XPATH, "//div[@class='job_seen_beacon']")

    for counter in range(len(jobs)):
        job = jobs[counter]
        dictio = {}
        #print("___ ___ ___")
        #print(job.text) # Debug  - removed to create a clean output for the response
        search1 = job.find_elements_by_xpath(".//div[contains(@class, 'topLeft')]")
        search2 = job.find_elements_by_xpath(".//span[contains(@id,'jobTitle')]")
        search3 = job.find_elements_by_xpath(".//span[@class='companyName']")
        search4 = job.find_elements_by_xpath(".//div[@class='companyLocation']")

        dictio["extra"] = search1[0].text
        dictio["work"] = search2[0].text
        dictio["company"] = search3[0].text
        dictio["place"] = search4[0].text

        if dictio["company"] == '':
            print(":(") # Debug
            pass
        
        current_jobs.append(dictio) 

    print(current_jobs)
    print(len(current_jobs))
    list_jobs.extend(current_jobs)

This is the output:

[{'extra': 'nuova offerta', 'work': 'Operatori Call Center Inbound e Outbound', 'company': 'PrestitoSì Finance', 'place': 'Milano, Lombardia'}, {'extra': 'nuova offerta', 'work': 'Operatore outbound Smart Working', 'company': 'Elite', 'place': 'Da remoto in 81030 Teverola'}, {'extra': 'nuova offerta', 'work': 'Operatore Call Center PART-TIME - 500 euro Mensili', 'company': 'GE.SAR', 'place': '81100 Caserta'}, {'extra': 'nuova offerta', 'work': 'Operatore Call Center FULL-TIME - 800 euro Mensili', 'company': 'GE.SAR', 'place': '81100 Caserta'}, {'extra': 'nuova offerta', 'work': 'ASSISTENZA CLIENTI', 'company': 'Rizdan Job', 'place': '81100 Caserta'}, {'extra': 'nuova offerta', 'work': 'MANAGER DI CALL CENTER FIRENZE', 'company': 'R1S S.r.l.', 'place': 'Firenze Centro, Toscana'}, {'extra': 'nuova offerta', 'work': 'Operatore telefonico', 'company': 'Refcons', 'place': 'Orta di Atella, Campania\n+2 luoghi'}, 
{'extra': 'nuova offerta', 'work': 'TEAM LEADER - RESPONSABILE CALL CENTER', 'company': 'CHRIMAR SRLS', 'place': 'Parma, Emilia-Romagna'}, {'extra': 'nuova offerta', 'work': 'Cerchiamo una venditrice di spazi pubblicitari', 'company': 'Lime Edizioni srl Milano', 'place': 'Corbetta, Lombardia'}, {'extra': 'nuova offerta', 'work': 'call center outbound', 'company': 'Jonio Comunicazioni S.r.l.', 'place': "Da remoto in 95131 Sant'Agata li Battiati"}, {'extra': 'nuova offerta', 'work': 'Operatore telefonico', 'company': '24 MAGGIO TELEFONIA', 'place': '80016 Marano di Napoli'}, {'extra': 'nuova offerta', 'work': 'Operatore Back Office- ROMA SUD', 'company': 'SAGRES SRL', 'place': '00144 Roma'}, {'extra': 'nuova offerta', 'work': 'Apprendista commesso', 'company': 'Tommi srl', 'place': 'Empoli, Toscana\n+2 luoghi'}, {'extra': 'nuova offerta', 'work': 'Architetto', 'company': 'FACILE RISTRUTTURARE S.p.a.', 'place': 'Milano, Lombardia'}, {'extra': 'nuova offerta', 'work': 'Addetta/o call center outbound', 'company': 'Fidani S.r.l', 'place': '00142 Roma'}]
15
[{'extra': 'nuova offerta', 'work': 'Operatore telemarketing - No vendita', 'company': 'AVC Utility Services', 'place': '21047 Saronno'}, {'extra': 'nuova offerta', 'work': 'TEAM LEADER CALL CENTER', 'company': 'ServiceHub srls', 'place': '80019 Qualiano\n+1 luogo'}, {'extra': 'nuova offerta', 'work': 'Capo Cantiere Automazione Industriale (018367)', 'company': 'Hunters Group S.r.l.', 'place': 'Rimini, Emilia-Romagna'}, {'extra': 'nuova offerta', 'work': 'IMPIEGATO COMMERCIALE VENDITA TELEFONICA OPERATORE OUTBOUND', 'company': 'MEDIAFIVE SRL', 'place': '10141 Torino'}, {'extra': 'nuova offerta', 'work': 'SEGRETARIA/CALL CENTER CATEGORIE PROTETTE', 'company': 'ETICA LAVORO Srl', 'place': 'Roma, Lazio'}, {'extra': 'nuova offerta', 'work': 'OPERATORE CUSTOMER CARE SETTORE TELEMATICO-ASSICURATIVO LING...', 'company': 'Randstad Italia', 'place': 'Roma, Lazio'}, {'extra': 'nuova offerta', 'work': 'Operatore telefonico Web sales/Upsales', 'company': 'H2Com', 'place': 'Da remoto in 00164 Roma'}, {'extra': 'nuova offerta', 'work': 'Operatore Telefonico Web Sales settore Telecomunicazioni', 'company': 'H2Com', 'place': 'Da remoto in 00164 Roma'}, {'extra': 'nuova offerta', 'work': 'Operatore telefonico ramo aziende', 'company': 'H2Com', 'place': 'Da remoto in 00164 Roma'}, {'extra': 'nuova offerta', 'work': 'Sales Account - Inbound / Outband | Commodities No Food', 'company': 'Page Personnel Italia', 'place': 'Lecco, Lombardia'}, {'extra': 'nuova offerta', 'work': 'Operatrice di call center', 'company': 'associazione JKT', 'place': '20153 Milano'}, {'extra': 'nuova offerta', 'work': 'IMPIEGATO ADDETTO AL TELEMARKETING', 'company': 'LinkLab srl', 'place': 'Trento, Trentino-Alto Adige'}, {'extra': 'nuova offerta', 'work': 'operatore call center inbound', 'company': 'Randstad', 'place': 'Da remoto in Rende, Calabria\n+1 luogo'}, {'extra': 'nuova offerta', 'work': 'OPERATORE TELEFONICO_INSERIMENTO IMMEDIATO', 'company': 'Mercurycall', 'place': 'Andria, Puglia'}, {'extra': 'nuova offerta', 'work': 'Operatore Call Center Part Time', 'company': 'We Can Consulting', 'place': '90135 Palermo'}]
15
[{'extra': 'nuova offerta', 'work': 'OPERATORE ASSISTENZA CLIENTI TELEFONICA - INBOUND', 'company': 'Etjca S.p.a.', 'place': 'Rende, Calabria'}, {'extra': 'nuova offerta', 'work': 'Addetto/a al Customer service', 'company': 'Adecco Italia', 'place': 'Lainate, Lombardia'}, {'extra': 'nuova offerta', 'work': 'Stage Customer Service', 'company': 'Adecco Italia', 'place': 'Segrate, Lombardia\n+1 luogo'}, {'extra': 'nuova offerta', 'work': 'Receptionist - lingua tedesca', 'company': 'Adecco Italia', 'place': 'Chioggia, Veneto'}, {'extra': 'nuova offerta', 'work': 'Accettatore clienti in officina', 'company': 'Adecco Italia', 'place': 'Torino, Piemonte'}, {'extra': 'nuova offerta', 'work': 'OPERATORE CUSTOMER SERVICE', 'company': 'OpenjobMetis', 'place': 'Catanzaro, Calabria\n+5 luoghi'}, {'extra': 'nuova offerta', 'work': 'Back Office - L.68/99 Lucca', 'company': 'Adecco Italia', 'place': 'Lucca, Toscana\n+1 luogo'}, {'extra': 'nuova offerta', 'work': 'addetto/a call center - categoria protetta legge 68/99', 'company': 'Randstad', 'place': 'Calderara di Reno, Emilia-Romagna'}, {'extra': 'nuova offerta', 'work': 'Operatore call center outbound', 'company': 'GIERRE CONTACT', 'place': 'Da remoto in Brindisi, Puglia\n+1 luogo'}, {'extra': 'nuova offerta', 'work': 'Smartworking Call Center Teleselling', 'company': 'Apophis s.r.l.', 'place': 'Da remoto in Roma, Lazio'}, {'extra': 'nuova offerta', 'work': 'Operatore telefonico da remoto', 'company': 'SGM DISTRIBUZIONE SRL', 'place': 'Da remoto in Torino, Piemonte'}, {'extra': 'nuova offerta', 'work': 'CALL CENTER MANAGER PROVINCIA DI MILANO', 'company': 'R1S S.r.l.', 'place': 'Cinisello Balsamo, Lombardia'}, {'extra': 'nuova offerta', 'work': 'operatore call center inbound e gestione canali digital', 'company': 'Msc srl', 'place': 'Reggio Emilia, Emilia-Romagna'}, {'extra': 'nuova offerta', 'work': 'Operatori Call Center Sede Triggiano e Sede Bari fisso 650', 'company': 'Apophis s.r.l.', 'place': 'Bari, Puglia'}, {'extra': 'nuova offerta', 'work': 'OPERATORE CALL CENTER PART-TIME', 'company': 'L&C', 'place': 'Roma, Lazio'}]
15
RichEdwards
  • 3,423
  • 2
  • 6
  • 22
  • Wow. Thanks a lot for the big effort! I will test this code instantly. I am confused tho... I have tried previously to close the popup, but I was not using any kind of explicit wait. If your code works, does it mean that the popup is effectively hiding the text inside of some html tags? – Federico Dorato Jun 16 '22 at 14:34
  • 1
    It is a bit odd - normally you can still get data from behind these sorts of poups. To be honest I didn't look into it! I've seen similar types of blocking modal dialog before. I was able to diagnose it and solve it quickly by debugging the code. Once I saw it was just a synchronisation issue i put the second wait in (the one for `EC.invisibility_of_element`) and the problem went away :-) – RichEdwards Jun 16 '22 at 15:55
0

Selenium allows you to click on elements. It could be as easy as:

element = element.find_element_by_css_selector('#close-button').click()

Use an if-statement (in a try/catch block to catch the error if the popup does not appear) every time you continue to the next page to check if the popup appears, if so, find the close-button element and click on it, wait a little and continue the rest of the program.

Update: Added RichEdwards's comment to the explanation.

Impaex
  • 106
  • 8
  • 1
    `If` statement work work - selenium throws an error if it doesn't appear - you need to use a `try/catch` block to handle negative element validation ;-) – RichEdwards Jun 16 '22 at 10:32
  • @RichEdwards Good addition, I have updated the answer to include it, thank you! – Impaex Jun 16 '22 at 10:37
  • @Impaex This does not answer my question and it does not solve my issue. I have tried to close the popup, but it does not have any kind of impact on the rest of the page. Manually, you can see that even while the popup is present, the rest of the html document is not lacking anything. Closing the popup does not change the rest of the document. – Federico Dorato Jun 16 '22 at 13:33