The page https://www.indeed.com/jobs
is protected by CloudFlare
.
import requests
params={
'q': 'motorcycle mechanic',
'l': 'New York, NY'
}
http_headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
response = requests.get('https://www.indeed.com/jobs', headers=http_headers, params=params, allow_redirects=True,
verify=True, timeout=30)
output print(response.headers)
Note the 'Server': 'cloudflare' in the output.
{'Date': 'Sat, 01 Apr 2023 18:42:55 GMT', 'Content-Type': 'text/html; charset=UTF-8',
'Transfer-Encoding': 'chunked', 'Connection': 'close', 'Cross-Origin-Embedder-Policy':
'require-corp', 'Cross-Origin-Opener-Policy': 'same-origin', 'Cross-Origin-Resource-Policy': 'same-origin', 'Permissions-Policy':
'accelerometer=(),autoplay=(),camera=(),clipboard-read=(),clipboard-write=(),fullscreen=(),
geolocation=(),gyroscope=(),hid=(),interest-cohort=(),magnetometer=(),microphone=(),
payment=(),publickey-credentials-get=(),screen-wake-lock=(),serial=(),sync-xhr=(),
usb=()', 'Referrer-Policy': 'same-origin', 'X-Frame-Options': 'SAMEORIGIN', 'Cache-Control':
'private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0',
'Expires': 'Thu, 01 Jan 1970 00:00:01 GMT', 'Set-Cookie': '__cf_bm=afTsbfjJKeoN7yqVDI7bGjYTFhaF_QEDC9mCtkjT1Js-1680374575-0-AQJ5H4x6T28fONNVrM8Fh2nYeq6G8RB3+L/vxbSJwWTzIjPb0CeR/HO1AsKx9GRj6dLZz+ZHZ/Oc8om0NMQ+/YM=;
path=/; expires=Sat, 01-Apr-23 19:12:55 GMT; domain=.indeed.com; HttpOnly; Secure;
SameSite=None', 'Vary': 'Accept-Encoding', 'Server': 'cloudflare',
'CF-RAY': '7b12f98b0847adb9-ATL', 'Content-Encoding': 'br', 'alt-svc': 'h3=":443";
ma=86400, h3-29=":443"; ma=86400'}
output print(response.text)
snippets
These snippets indicated that the page is throwing a Cloudflare challenge
for your Python request
.
<span id="challenge-error-text">
Enable JavaScript and cookies to continue
</span>
trkjs.setAttribute('src', '/cdn-cgi/images/trace/managed/js/transparent.gif?ray=7b130075eea3ad6b');
cpo.src = '/cdn-cgi/challenge-platform/h/b/orchestrate/managed/v1?ray=7b130075eea3ad6b';
I would recommend using cloudscraper to scrape the site. I don't want to post the exact code that I used to bypass the CloudFlare
protection for indeed.com
# .bypass() is a function based on the link I provided.
soup = Cloudflare('https://www.indeed.com/jobs').bypass()
table_results = soup.find_all('td', {'class': 'resultContent'})
for item in table_results:
link = item.find('span')
print(link.attrs)
# {'title': 'Auto Mechanic (Diesel)', 'id': 'jobTitle-9d7ba98aa6ce1036'}
# {'title': 'Motorcycle Mechanic A,B OR C', 'id': 'jobTitle-a91f7c5e2d1c0a53'}
# {'title': 'NEW VEHICLE SET UP MECHANIC', 'id': 'jobTitle-cbe3a30bbf3e415d'}
# {'title': 'Motorcycle Mechanic', 'id': 'jobTitle-8736df00befc62ab'}
# {'title': 'Mechanic', 'id': 'jobTitle-cf8a92124f5fe421'}
This site provides the basic details on how to use cloudscraper
, which will allow you to bypass the CloudFlare
protection.
While Cloudscraper
works most of the time it might be better to use a paid service, such as zenrows to bypass the CloudFlare
protection for https://www.indeed.com/jobs