Looks like the site is randomly serving two different pages with different markup and selectors, probably as part of an A/B testing scheme.
If your .jobsearch-SerpCard
selector fails to return anything, you know you're on site B, and you can use .jobCard-mainContent
as a root node, then change all of the child selectors. This is sort of a pain.
An easier way is to do something you should do anyway when making batches of requests to the same endpoint: open a session, which persists state between requests, improving speed and consistency in general.
# ... same code ...
with requests.Session() as session:
for y in range(5):
URL = f'https://www.indeed.com/jobs?q=software+engineer&start={rand}'
print(URL)
page = session.get(URL) # changed from requests.get(URL)
# ... same code, indented ...
Now, the A/B test should be consistent throughout the session.
The problem is, you can still get shuffled onto B for the entire session, in which case all pages will give no results. The solution is to detect this in a loop and keep trying new sessions until you do get results:
import requests
from bs4 import BeautifulSoup
def scrape():
rand = 0
count = 1
with requests.Session() as session:
for y in range(5):
URL = f'https://www.indeed.com/jobs?q=software+engineer&start={rand}'
print(URL)
page = session.get(URL) # changed from requests.get(URL)
rand += 10
soup = BeautifulSoup(page.content, 'html.parser')
job_elems = soup.find_all('div', class_='jobsearch-SerpJobCard')
if not job_elems:
print(f"got 0 results, must have been served B;"
" try again from scratch...")
return False # we were served B, try a new session
print(f"got {len(job_elems)} results; we're on A! Let's scrape!")
# ... we were served A; do your scraping as before ...
return True
while not scrape():
pass
I think this is way better than writing all of the selector code twice. Here's a sample run:
https://www.indeed.com/jobs?q=software+engineer&start=0
got 0 results, must have been served B; try again from scratch...
https://www.indeed.com/jobs?q=software+engineer&start=0
got 0 results, must have been served B; try again from scratch...
https://www.indeed.com/jobs?q=software+engineer&start=0
got 15 results; we're on A! Let's scrape!
https://www.indeed.com/jobs?q=software+engineer&start=10
got 15 results; we're on A! Let's scrape!
https://www.indeed.com/jobs?q=software+engineer&start=20
got 15 results; we're on A! Let's scrape!
https://www.indeed.com/jobs?q=software+engineer&start=30
got 15 results; we're on A! Let's scrape!
https://www.indeed.com/jobs?q=software+engineer&start=40
got 15 results; we're on A! Let's scrape!
You can see the first request served up the B markup that we haven't bothered writing selectors for, so we try again with a fresh session. On the second request, we got B again, so we try again with a fresh session. On the third request, we got the correct markup so we keep the session open and iterate through the pages relying on the fact that the server knows who we are and should serve the same site throughout.
If this all seems brittle -- that's web scraping for you; websites can and will change at any time! If you can use an API or JSON endpoint, that's the way to go. I notice these listings are also available as a JS object var jobmap = {};
which you can scrape out of the static HTML, but that comes with its own brand of brittleness. See Web-scraping JavaScript page with Python for generic information on alternative strategies.
See Python download multiple files from links on pages if performance is important to you (probably the main reason you'd want to write selectors for both A/B which avoids the extra initial requests -- keep in mind there may be more than 2 markup types, though, and on average it probably only costs 1 extra request, assuming 50/50 chances of getting A/B).
As a final note, if you plan on running this scraper for an extended period, you might want to change the while
loop to a for
loop and raise an error if you can't get the right page after 20 or 30 tries; it probably means the website changed permanently. response.raise_for_status()
is also useful for getting a clear error message when your response is not OK.