Problem with ads between search results when while trying to scrape job information from a website

Question

I am currently trying to scrape a job website (jobs.at). In the code below I am looking for the names of the job results and then safe them in a dictionary. The code below works for the first 15 results. The problem is that after every 15th search result, the website posts an ad in between the job search results. The URL of the website is: https://www.jobs.at/j/personalverrechnung?dateFrom=all

The html code of the ad is the following:


<form method="POST" action="https://www.jobs.at/jobalarm" accept-charset="UTF-8" class="c-job-alarm-form j-c-card j-u-margin-bottom-xl j-u-overflow-hidden j-u-background-color-cyan-50" data-logged-in="false" novalidate data-form-name="job-alarm-form">…</form>

Can you think of any way to skip over this add and collect all search results?

jobs = []

for search_result in soup.find_all('div', class_="c-search-results"):
    for job_name in soup.find_all("h2", class_="c-job-headline j-u-typo-m j-u-font-weight-bold j-u-margin-bottom-3xs"):
        try:
            job_name = job_name.a.text
        except Exception as e:
            job_name = None
    
        jobs.append({'job_name': job_name})
    
print(jobs)

The url of the job search with the given job filter: https://www.jobs.at/j/personalverrechnung?dateFrom=all

It seems like if you are trying to get more than 15 results the problem does not come from the add. The page is just loading chunks of 15 per 15 job offers each time you scroll to the bottom, so beautifull soup only get the visible part of the page... you have to look at how to parse an infinite scrolling page, good luck. — thesylio, Feb 24 '21 at 16:23
Do you have any suggestions on how to do that? Any help is highly appreciated!! — CJSnoggle, Feb 24 '21 at 20:14
Apparently Selenium might be a good solution, you'll have to scroll and get the results, now I'm not familiar with this lib but maybe this link can help https://stackoverflow.com/questions/20986631/how-can-i-scroll-a-web-page-using-selenium-webdriver-in-python — thesylio, Feb 25 '21 at 08:42

score 0 · Answer 1 · answered Feb 24 '21 at 23:02

The Ads are under the class c-search-listing-item c-search-listing-item--jobalarm. You can remove the tags from the soup using the .extract() method:

for tag in soup.find_all(
    class_="c-search-listing-item c-search-listing-item--jobalarm"
):
    tag.extract()

# Continue code as normal
jobs = []

for search_result in soup.find_all("div", class_="c-search-results"):
    ...

Problem with ads between search results when while trying to scrape job information from a website

1 Answers1