1

I'm trying to grab job titles from the search result of a webpage, indeed.com, using the requests module. Here is the link to that webpage where I wish to fetch the job titles.

The following is how I've already tried:

import requests
from bs4 import BeautifulSoup

link = "https://www.indeed.com/jobs"
params={
    'q': 'motorcycle mechanic',
    'l': 'New York, NY'
}
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
def get_job_titles(url):
    res = requests.get(url,params=params,headers=headers)
    soup = BeautifulSoup(res.text,"lxml")
    link_list = []
    for item in soup.select("#mosaic-jobResults td.resultContent h2 > a > span[id^='jobTitle']"):
        link_list.append(item.get("href"))
    return link_list

if __name__ == '__main__':
    for title in get_job_titles(link):
        print(title)

When I run the script, I always get status 403. How can I get the job titles from that webpage using the requests module?

robots.txt
  • 96
  • 2
  • 10
  • 36

2 Answers2

4

The page https://www.indeed.com/jobs is protected by CloudFlare.

import requests

params={
    'q': 'motorcycle mechanic',
    'l': 'New York, NY'
}

http_headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
response = requests.get('https://www.indeed.com/jobs', headers=http_headers, params=params, allow_redirects=True,
                   verify=True, timeout=30)

output print(response.headers)

Note the 'Server': 'cloudflare' in the output.

{'Date': 'Sat, 01 Apr 2023 18:42:55 GMT', 'Content-Type': 'text/html; charset=UTF-8', 
'Transfer-Encoding': 'chunked', 'Connection': 'close', 'Cross-Origin-Embedder-Policy': 
'require-corp', 'Cross-Origin-Opener-Policy': 'same-origin', 'Cross-Origin-Resource-Policy': 'same-origin', 'Permissions-Policy': 
'accelerometer=(),autoplay=(),camera=(),clipboard-read=(),clipboard-write=(),fullscreen=(),
geolocation=(),gyroscope=(),hid=(),interest-cohort=(),magnetometer=(),microphone=(),
payment=(),publickey-credentials-get=(),screen-wake-lock=(),serial=(),sync-xhr=(),
usb=()', 'Referrer-Policy': 'same-origin', 'X-Frame-Options': 'SAMEORIGIN', 'Cache-Control': 
'private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 
'Expires': 'Thu, 01 Jan 1970 00:00:01 GMT', 'Set-Cookie': '__cf_bm=afTsbfjJKeoN7yqVDI7bGjYTFhaF_QEDC9mCtkjT1Js-1680374575-0-AQJ5H4x6T28fONNVrM8Fh2nYeq6G8RB3+L/vxbSJwWTzIjPb0CeR/HO1AsKx9GRj6dLZz+ZHZ/Oc8om0NMQ+/YM=; 
path=/; expires=Sat, 01-Apr-23 19:12:55 GMT; domain=.indeed.com; HttpOnly; Secure; 
SameSite=None', 'Vary': 'Accept-Encoding', 'Server': 'cloudflare', 
'CF-RAY': '7b12f98b0847adb9-ATL', 'Content-Encoding': 'br', 'alt-svc': 'h3=":443"; 
ma=86400, h3-29=":443"; ma=86400'}

output print(response.text) snippets

These snippets indicated that the page is throwing a Cloudflare challenge for your Python request.


 <span id="challenge-error-text">
                        Enable JavaScript and cookies to continue
                    </span>


trkjs.setAttribute('src', '/cdn-cgi/images/trace/managed/js/transparent.gif?ray=7b130075eea3ad6b');

 cpo.src = '/cdn-cgi/challenge-platform/h/b/orchestrate/managed/v1?ray=7b130075eea3ad6b';

I would recommend using cloudscraper to scrape the site. I don't want to post the exact code that I used to bypass the CloudFlare protection for indeed.com

# .bypass() is a function based on the link I provided.

soup = Cloudflare('https://www.indeed.com/jobs').bypass()
table_results = soup.find_all('td', {'class': 'resultContent'})
for item in table_results:
    link = item.find('span')
    print(link.attrs)
    # {'title': 'Auto Mechanic (Diesel)', 'id': 'jobTitle-9d7ba98aa6ce1036'}
    # {'title': 'Motorcycle Mechanic A,B OR C', 'id': 'jobTitle-a91f7c5e2d1c0a53'}
    # {'title': 'NEW VEHICLE SET UP MECHANIC', 'id': 'jobTitle-cbe3a30bbf3e415d'}
    # {'title': 'Motorcycle Mechanic', 'id': 'jobTitle-8736df00befc62ab'}
    # {'title': 'Mechanic', 'id': 'jobTitle-cf8a92124f5fe421'}

This site provides the basic details on how to use cloudscraper, which will allow you to bypass the CloudFlare protection.

While Cloudscraper works most of the time it might be better to use a paid service, such as zenrows to bypass the CloudFlare protection for https://www.indeed.com/jobs

Life is complex
  • 15,374
  • 5
  • 29
  • 58
0

I hope I am wrong, but it does not seem possible to scrape the job titles without executing the JS that renders the respective HTML.

I used my browser's Dev Tools, to check if the website made a GET request for a file with the data you want, but there was no such request.

I used requests to send the GET request to the page's URL with the User-Agent header and all of the GET request's cookies. The response was an HTML file with the message "Enable JavaScript and cookies to continue". I also tried sending the request with all of headers and then without the cookie header; both responses were a byte string.

Due to this I think you will need to use selenium or playwright.

Code:

(I did not include my cookies for privacy. You can get your own with your browser's Dev Tools)

import requests
from bs4 import BeautifulSoup

def view_html(url, my_headers=None, my_cookies=None):
    res = requests.get(url, headers=my_headers, cookies=my_cookies)
    soup = BeautifulSoup(res.content,"html.parser")
    print(res.content)
    print()
    print(soup.prettify())

if __name__ == "__main__":

    h = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
        'Host': 'www.indeed.com',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate, br',
        'DNT': '1',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'Sec-Fetch-Dest': 'document',
        'Sec-Fetch-Mode': 'navigate',
        'Sec-Fetch-Site': 'none',
        'Sec-Fetch-User': '?1',
        'TE': 'trailers'
    }

    url = "https://www.indeed.com/jobs?q=motorcycle%20mechanic&l=New%20York%2C%20NY"
    view_html(url, h)

Output I received when all the headers (including the cookies) were used:

<!DOCTYPE html>
<html lang="en-US">
 <head>
  <title>
   Just a moment...
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="noindex,nofollow" name="robots"/>
  <meta content="width=device-width,initial-scale=1" name="viewport"/>
  <link href="/cdn-cgi/styles/challenges.css" rel="stylesheet"/>
 </head>
 <body class="no-js">
  <div class="main-wrapper" role="main">
   <div class="main-content">
    <noscript>
     <div id="challenge-error-title">
      <div class="h2">
       <span class="icon-wrapper">
        <div class="heading-icon warning-icon">
        </div>
       </span>
       <span id="challenge-error-text">
        Enable JavaScript and cookies to continue
       </span>
      </div>
     </div>
    </noscript>
    <div id="trk_jschal_js" style="display:none;background-image:url('/cdn-cgi/images/trace/managed/nojs/transparent.gif?ray=7ae4ebc3addb024e')">
    </div>
    <form action="/jobs?q=motorcycle%20mechanic&amp;l=New%20York%2C%20NY&amp;__cf_chl_f_tk=zIdr3DHmAJEASpAwEGxbIEtQmt5wL_BF3Yqs2ajK6Tg-1679891666-0-gaNycGzNCxA" enctype="application/x-www-form-urlencoded" id="challenge-form" method="POST">
     <input name="md" type="hidden" value="I9Py8hlouSMytpiZY5RC15bVHtkglArAOJhDSLlZ0w4-1679891666-0-AUAyCZMpA1ufMJS2Xm-l7-qV8mKgUPs47V9ZDsPd8nkfsXw6IJb7GbEKMebi0J5s5-bQyKpn4REwwOz6pc8ZRFXo728tfb3-nR5KCNz0Za58ndbkYk8lebn2zmUH9xPaC44PaHol0bp2fRU8SsQIaJxzpcrfwkE7weOPwGNK25S15T1EU8RVHmQqpNXPVIyxz5yPbZxXOAPaAhCvX9sEoqtxBbx1W413ks4H3bTvC2I8eUKiySi_-8oXBLn14BGwgsa1BN8QIVFXabuesHhLRgpW5eX7gvm5eDAbLHSOUJ7pmGV8AAJhEzAnSKrsf6EY6fkxo8BlGVbqcZHkVbImgzhOUgdSxEHtnW4_DKYk4SMFfj3oPShf_jKvvioPhDbVjdPneiarW5yRS_Bmcy_nnyzPdabMo-8FjG-s_bgn00tzsEoLVYr2nDp_yJbM2HIbK_3p-PtLsPy9TJ9RMPzKrdHp6YpmsykKXZr2_lx_pLGEIbKuSOvblkuZ7j8GV9jGTMPfvdkdJUse_AOHaRCHZSGbDTCtLjC-wIfQQNXLWcufQjIVYTaLidsh3gCMkWceD37_lUSQDTp6udZhT0aUuwDOmYMVCV6eyLbPMfjp4i_pDGrXh6U2B55e8PIX_db_LGS2f1Nt1XhvdHzQObnnCbiNX8aMKL6IAQQekElS5QZCJCjjtu4NRE8aEjh2x8KXBFbYYjMOSc8zEga6RzB_0FtKMuDgguMWIoewCV-S7vdUh0MtCQKULKFu9x20DorhF-pSqyQMwCfSVSO2p1ARc6Dofwo8t8wINR86wMJoJSJonjctytcBzQID0oKPJLZJG4K__xnOgL2Lf0-1NoMhAsxFHGcJi2_Gljsv03jneOizIkwxo2mmI26A-nJ9cI5dK684XisuuG7PnBhSpLpoYZoJPvc9HkkCl4rJKKE6T2p0UT3wQ4KNdZwAgdyVVuQ5BElh9qG5Sq9eiBYvccrzejxa4pl04IFl8zmHYF6YFUOhjhC2P7tLBPBal_Le1Fvz2YHSv8gwial8Su03nMmaugTt3COeQT1wKUGaXff5n-KTYa9wA6p9l1QqupFW5XvkpeVR3BOLRvK2RN6os3IuAUY4wxyCY9IVLd3QO7AB-VkQazUVpgwdDvLVBVaHgZAakh_NEPX3KlS3s8_dTUqJ9yp4aJoCDuOLOdk4wnNrkMPa6nOeviDQcK5066i_yGD8Ege1ss_vJ5d583vuKnZ5J0Fi1JK8JOoJO8SVJ7oIZTLixZmgzVlrMFmaqP5P0Exb2pT--WCQQsHF9jyLq40_Mt5HWpMO5zCMZJK-tK96WfmxAsIdE7R46O1nkEkap2SXeGyBwT95xjixkzZAOguKJe_eUB5gDt9rKc65JLRL3Vv1pu6YSU-gFmw5VZEzHzsiQ_iOlybBKQhd37RXb3aBsxkxE5lOM8UjcHj1zRU3fy3SRS1OEHKkgCX_07qzH0-7rnCBoyaK4Ci-c88YYfDgx910k1RWOiPcsBW5-WNup528wYvBBPzEH3bbQcG5ZkMLPI63OWhM8ExAcXHjRSYpKyXdWMdpdTgPDNMa87SBr-TE87nNURVGAEwKKdtbYv0Mg6IzjSqKW-3RQokIRTHtHjWCatz3Vl1K-xHznrzAE2Y7yOzvs4hN79ZOQSMcEcCPvEnGHbSNImcWSm97Qc4g5mPXr2t7KqTlcDZxvEzEPMvC5p6Efa2qIrR4HuccRJ35gvUypTVuIHByqHcNzntourd6y_-ra5fwOhWKIpsRsQwKciocn0Q23SwdM3HpMKio-t5bWYEKBeAFvmRvNTMhrqyOljQxaSUxd5qTQfqP1k0v4B9ish-zghMSGP-jjsb7PUS2xLQcJDkU_SbJodCUEfi_frtq9X-jkTlJJrCDWkTr-9ifk2kXB9dTa7wBrGsww5rDSFDuRAlvxLR9vV-csq-hHGcxoVnUojvMJikt-j0E8F_apdlgpRoYStDcGu992BJcHVu5n8SI-fK4rMsXAHEL-Dr5FCGowYeTCnLTBXAqnZv_NDzsIa6Y99mYZWY0u97sc9-UBWCRXBsy0-icc_9LI83BVpOdkUHQc62VbbEqjNRzMJU1GhaJMxGU6VxCg5lkQHR1p4Ssfgwe6U3JxW04SqlvWRR785kGtzhTzZ2mToNcG7h64-FVuHoHoWc864uIKqrK9By60W-aOI5soNEyI9TaFaeYIqvhkxIqkn56JETNmexeoOp7pPPx34PB5SnTb79E5Pa-fWvHxRwScJWXp-JOsvPbd3ws4LYw-tEzQEqUJ9MqMTm2FXJ44NX82ansO5tJyONdAxwr-8VWnYU3x6USelSwznYsIP3lSdq3x3KA71iMYyUxvhsVHo5qBR-4SWKtZvwKxW2inHD-LYRuRcwzKHvTiPA9qDaN-fqPwhOkiX4JpVed-_5gxc05xTVuJNa2rAdWWIbcbwIbwWYjcRqrf41FXa8sulx13qE8CksWF2SFBqtfQa9p5s89aa_NU_fwXKRawaXWa2js_n87ZE2wgetE6vk8_b9qx3kYOsZD4MgeLC_-idkcFvKy8Pk2FaYLVX8-rHNZ0l43DJa6blV4BaaLzi1QbR_Kl6XsTk0d1UgcU_zmY0caReNAx_sKY88zIU4B5CQR9JwhDm2lkzgmrxqOhcuHemA2-UB6Ti8wXRJRzi3yIxBuvtPL8ZvS2I9sAnFzQfJZzKpvTVoKkRSqsBHM8F0IcRXYcRIFuLvWPDX4NGeWOMjU3ALVy87ZVvtOvBoOQonwdZ4mrWQ"/>
    </form>
   </div>
  </div>
Übermensch
  • 318
  • 2
  • 11