2

I'm trying to make a web scraper to scrape job posts off of Indeed. I don't understand why sometimes in the for loop it only runs the requests portion and skips the rest of the code.

EX of Output:

1)

Status Code

Status Code

Job Posts

Status Code

Job Posts

etc..

2)

Status Code

Status Code

Status Code

Status Code

Status Code

I want to get it printing the status code and job posts instead of skipping.

import requests
from bs4 import BeautifulSoup

rand = 0
count = 1

for y in range(5):
    URL = f'https://www.indeed.com/jobs?q=software+engineer&start={rand}'
    print(URL)
    page = requests.get(URL)
    print(page)
    rand += 10

    soup = BeautifulSoup(page.content, 'html.parser')

    job_elems = soup.find_all('div', class_='jobsearch-SerpJobCard')

    for job_elem in job_elems:
        title = job_elem.find('h2', class_='title').a['title']
        company = job_elem.find('span', class_='company').text.strip()
        location = job_elem.find('div', class_='recJobLoc')['data-rc-loc']
        count += 1
        try:
            salary = job_elem.find('span', class_='salaryText').text.strip()
            print(salary)
        except:
            pass
        print(count)
        print(title)
        print(company)
        print(location)
        print()
DV123
  • 23
  • 2
  • Put `import traceback; traceback.print_exc()` in the `except` block and post the output. – theoctober19th May 22 '21 at 01:41
  • Traceback (most recent call last): File , line 24, in salary = job_elem.find('span', class_='salaryText').text.strip() AttributeError: 'NoneType' object has no attribute 'text' The reason why I have the try/except is to get the salary if the job posts provides it. If not then ignore. – DV123 May 22 '21 at 01:48
  • This try/except isn't the problem, the problem is your `div.jobsearch-SerpJobCard` selector is occasionally not on the page. – ggorlen May 22 '21 at 01:53
  • @ggorlen Ohh, I see hmm. – DV123 May 22 '21 at 01:53
  • @ggorlen Do you have any idea why sometimes the selector isn't there? I've tried to manually look on each page and it's there each time for me. – DV123 May 22 '21 at 02:04
  • Seems like they might have an a/b thing of some sort going on. I'm seeing at least two different documents being served each with its own CSS. – ggorlen May 22 '21 at 02:06
  • @ggorlen What does a/b thing mean? Sry, big noob. – DV123 May 22 '21 at 02:08
  • It means they're randomly serving one of two sites, either A or B. Websites do this all of the time, design two things, serve one or the other and see which performs better. – ggorlen May 22 '21 at 02:19
  • @ggorlen Thank you for the info. I was wondering, lets say on site A I get the info I want. Is it a possible solution to keep requesting a certain page until I land on site A instead of B? – DV123 May 22 '21 at 02:39
  • Did you see my answer? – ggorlen May 22 '21 at 02:41

1 Answers1

0

Looks like the site is randomly serving two different pages with different markup and selectors, probably as part of an A/B testing scheme.

If your .jobsearch-SerpCard selector fails to return anything, you know you're on site B, and you can use .jobCard-mainContent as a root node, then change all of the child selectors. This is sort of a pain.

An easier way is to do something you should do anyway when making batches of requests to the same endpoint: open a session, which persists state between requests, improving speed and consistency in general.

# ... same code ...

with requests.Session() as session:
    for y in range(5):
        URL = f'https://www.indeed.com/jobs?q=software+engineer&start={rand}'
        print(URL)
        page = session.get(URL) # changed from requests.get(URL)

# ... same code, indented ...

Now, the A/B test should be consistent throughout the session.

The problem is, you can still get shuffled onto B for the entire session, in which case all pages will give no results. The solution is to detect this in a loop and keep trying new sessions until you do get results:

import requests
from bs4 import BeautifulSoup

def scrape():    
    rand = 0
    count = 1

    with requests.Session() as session:
        for y in range(5):
            URL = f'https://www.indeed.com/jobs?q=software+engineer&start={rand}'
            print(URL)
            page = session.get(URL) # changed from requests.get(URL)
            rand += 10
            soup = BeautifulSoup(page.content, 'html.parser')
            job_elems = soup.find_all('div', class_='jobsearch-SerpJobCard')

            if not job_elems:
                print(f"got 0 results, must have been served B;"
                      " try again from scratch...")
                return False # we were served B, try a new session
            
            print(f"got {len(job_elems)} results; we're on A! Let's scrape!")
            # ... we were served A; do your scraping as before ...

    return True

while not scrape():
    pass

I think this is way better than writing all of the selector code twice. Here's a sample run:

https://www.indeed.com/jobs?q=software+engineer&start=0
got 0 results, must have been served B; try again from scratch...
https://www.indeed.com/jobs?q=software+engineer&start=0
got 0 results, must have been served B; try again from scratch...
https://www.indeed.com/jobs?q=software+engineer&start=0
got 15 results; we're on A! Let's scrape!
https://www.indeed.com/jobs?q=software+engineer&start=10
got 15 results; we're on A! Let's scrape!
https://www.indeed.com/jobs?q=software+engineer&start=20
got 15 results; we're on A! Let's scrape!
https://www.indeed.com/jobs?q=software+engineer&start=30
got 15 results; we're on A! Let's scrape!
https://www.indeed.com/jobs?q=software+engineer&start=40
got 15 results; we're on A! Let's scrape!

You can see the first request served up the B markup that we haven't bothered writing selectors for, so we try again with a fresh session. On the second request, we got B again, so we try again with a fresh session. On the third request, we got the correct markup so we keep the session open and iterate through the pages relying on the fact that the server knows who we are and should serve the same site throughout.

If this all seems brittle -- that's web scraping for you; websites can and will change at any time! If you can use an API or JSON endpoint, that's the way to go. I notice these listings are also available as a JS object var jobmap = {}; which you can scrape out of the static HTML, but that comes with its own brand of brittleness. See Web-scraping JavaScript page with Python for generic information on alternative strategies.

See Python download multiple files from links on pages if performance is important to you (probably the main reason you'd want to write selectors for both A/B which avoids the extra initial requests -- keep in mind there may be more than 2 markup types, though, and on average it probably only costs 1 extra request, assuming 50/50 chances of getting A/B).

As a final note, if you plan on running this scraper for an extended period, you might want to change the while loop to a for loop and raise an error if you can't get the right page after 20 or 30 tries; it probably means the website changed permanently. response.raise_for_status() is also useful for getting a clear error message when your response is not OK.

ggorlen
  • 44,755
  • 7
  • 76
  • 106
  • I'm not that familiar with sessions, but to my understanding. We create a session, thru this session it will either be A or B. If it's A, we're all good and we get the data we want. If it's B, we try to create new sessions until it's A. – DV123 May 22 '21 at 02:48
  • You got it. Try running the example a few times. – ggorlen May 22 '21 at 02:50
  • No problem, thanks for the interesting question with code that shows a clear, reproducible problem. – ggorlen May 22 '21 at 03:00
  • Depending on why they're doing A/B, it could also go away quite quickly - eg if it's a new version rollout rather than user response tweaking. – Jiří Baum May 22 '21 at 03:49
  • Yep, the site can change for any reason at any time and break everything, "If this all seems brittle...". Hence, slap a session and a loop on it and scrape away until more information arises: sessions are good anyway and the loop can't hurt even if you code up selectors for B in case a page C arises. I'm not sure if OP plans this to be a one-off scrape or long-running scheduled task but if this were long-running, I'd have it log and email me if it stops working, then tweak it at that time instead of trying to guess what will happen. – ggorlen May 22 '21 at 04:22