1

I'm trying to create a script in python using requests module to scrape the title of different jobs from a website. To parse the title of different jobs I need to get the relevant response from that site first so that I can process the content using BeautifulSoup. However, When I run the following script, I can see that the script produces gibberish which literally do not contain the titles I look for.

website link (In case you don't see any data, make sure to refresh the page)

I've tried with:

import requests
from bs4 import BeautifulSoup

link = 'https://www.alljobs.co.il/SearchResultsGuest.aspx?'

query_string = {
    'page': '1',
    'position': '235',
    'type': '',
    'city': '',
    'region': ''
}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'
    s.headers.update({"Referer":"https://www.alljobs.co.il/SearchResultsGuest.aspx?page=2&position=235&type=&city=&region="})
    res = s.get(link,params=query_string)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select(".job-content-top [class^='job-content-top-title'] a[title]"):
        print(item.text)

I even tried like this:

import urllib.request
from bs4 import BeautifulSoup
from urllib.parse import urlencode

link = 'https://www.alljobs.co.il/SearchResultsGuest.aspx?'

query_string = {
    'page': '1',
    'position': '235',
    'type': '',
    'city': '',
    'region': ''
}

headers={
    "User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36",
    "Referer":"https://www.alljobs.co.il/SearchResultsGuest.aspx?page=2&position=235&type=&city=&region="  
}

def get_content(url,params):
    req = urllib.request.Request(f"{url}{params}",headers=headers)
    res = urllib.request.urlopen(req).read()
    soup = BeautifulSoup(res,"lxml")
    for item in soup.select(".job-content-top [class^='job-content-top-title'] a[title]"):
        yield item.text

if __name__ == '__main__':
    params = urlencode(query_string)
    for item in get_content(link,params):
        print(item)

How can I fetch the title of different jobs using requests?

PS Browser simulator is not an option here to do the task.

asmitu
  • 175
  • 11
  • It's not gibberish, but Js code, minified and obfuscated. I'm afraid you can't avoid Selenium, unless you can reverse engineer that JS code. – t.m.adam Mar 04 '20 at 20:23
  • It seems lemonlin and Jack Fleeting managed to get the final html with your code. I would ask them for more details, maybe this is related to OS or IP location. – t.m.adam Mar 04 '20 at 20:26

2 Answers2

5

I'd like to see what your gibberish looks like. When I ran your code, I got a bunch of Hebrew characters (unsurprising, since the website is in Hebrew) and job titles:

לחברת הייטק מובילה, IT project manager דרושים AllStars-IT Group (MT) אלעד מערכות מגייסת מפתח /ת JAVA לגוף רפואי גדול היושב בתל אביב! דרושים אלעד מערכות מנתח /ת מערכות ומאפיין /ת דרושים מרטנס הופמן שירותי מחשוב אנשי /נשות תפעול ותמיכה טכנית למוצר אינטרנטי דרושים המימד השלישי DBA SQL /ORACLE דרושים CPS Jobs דרושים /ות אנשי /נשות תמיכה על מערכת פריוריטי, שכר מתגמל למתאימים /ות דרושים חבר הון אנושי מפתח /ת SAP ABAP דרושים טאואר סמיקונדקטור דרוש /ה Director of Data analytics דרושים אופיסופט Fullstack Developer דרושים SQLink מפתח /ת תשתיות דאטה ותומך תשתית BI דרושים המימד השביעי בע"מ מפתח /ת תשתיות דאטה ותומך /ת תשתית BI דרושים יוניטסק לארגון בעל משמעות גבוהה דרוש /ה תוכניתן /ית ABAP דרושים יוניטסק לחברת טלדור דרוש /ה ארכיטקט /ית למערכת פיקוד ובקרה עבור ארגון גדול בתל אביב דרושים טלדור Taldor מערכות מחשבים דרוש /ה מפתח /ת אינטגרציה דרושים SQLink דרוש /ה ראש צוות Full stack מתכנת /ת Senior Software Engineer Manager Senior Software Engineer Senior Embedded Software Engineer Embedded Software Engineer Senior Software Engineer Subsidiary PMM Manager תוכניתן /ית BackEnd Full Stack /Frontend Software Engineer Software Validation Engineer Principal Product Manager Quantum Algorithms Research intern Principal/Senior Detection Team Lead Support Engineer Software Engineer

Is your problem that you want to filter out the Hebrew characters? Because that just requires simple regex! Import the re package, and then replace your print statement with this:

print(re.sub('[^A-z0-9]+',' ',item.text))

Hope this helps!

lemonlin
  • 96
  • 6
  • Did you run OPs code exactly as is? When I tried that, I got none of what you have; just one long ` – Jack Fleeting Mar 04 '20 at 17:56
  • Check out the edit above. I've included the content in my post. Make sure to select `view page source` to see what really are there in the content. Thanks. – asmitu Mar 04 '20 at 17:57
  • @JackFleeting I copy/pasted into an .ipynb in google colab and ran the chunk, no edits, and this is what I got. It ran pretty perfectly for me. – lemonlin Mar 04 '20 at 18:19
  • Very, very strange. I ran exactly the same thing in a notebook but on my own computer (Win 10), and received exactly what @asmitu has in his gibberish file. It must be a character encoding issue, I guess. Never seen this before... – Jack Fleeting Mar 04 '20 at 18:32
  • If it's an encoding problem, then maybe mess around with adding `res.encoding = res.apparent_encoding` right before BeautifulSoup? I can't run it and see if it helps since I'm not getting the error (also Win 10). ref: https://stackoverflow.com/questions/44203397/python-requests-get-returns-improperly-decoded-text-instead-of-utf-8 – lemonlin Mar 05 '20 at 01:48
  • Most likely, this happens because the [Google colab](https://colab.research.google.com/) enjoys the exemption by https://www.alljobs.co.il, that allows the site be indexable by Google. – Alex Cohn Sep 24 '20 at 13:18
4

To successful get expected request, you have to use cookies. For the URL you need rbzid cookie is enough. You can get it manually, if it will expire you can implement solution using Selenium and Proxy Server to refresh it and continue scraping with requests.

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) ' \
             'Chrome/80.0.3987.122 Safari/537.36'
cookies = {
    'rbzid': 'DGF6ckG9dPQkJ0RhPIIqCu2toGvky84UY2z7QpJln31JVdw/YU4wJ7WXe5Tom9VhEvsZT6PikTaeZjJfsKwp'
             'M1TaCZr6tOHaOtE8jX3eWsFX5Zm8TJLeO8+O2fFfTHBf++lRgo/NaYq/sXh+QobO59zQRmZQd0XMjTSpVMDu'
             'YZS8C3GMsIR8cBt9gyuDCYD2XL8pVz68fD4OqBep3G/LnKR4bQsMiLHwKjglQ4fBrq8=',
}
headers = {'User-Agent': user_agent, }
params = (
    ('page', '1'),
    ('position', '235'),
    ('type', ''),
    ('city', ''),
    ('region', ''),
)

response = requests.get('https://www.alljobs.co.il/SearchResultsGuest.aspx',
                        headers=headers, params=params, cookies=cookies)

soup = BeautifulSoup(response.text, "lxml")
titles = soup.select("a[title]")
Sers
  • 12,047
  • 2
  • 12
  • 31
  • Your approach leads to the right direction. Ain't it possible to fetch the cookie using requests as well? Thanks. – asmitu Mar 05 '20 at 18:19
  • The response you got and shared is a Javascript code, you can try to do something with that. But I the easiest solution is to use proxy server and Selenium to catch required headers. You can use it only when header update required – Sers Mar 06 '20 at 21:15
  • 1
    @asmitu to overcome the issue related to `hard-coded cookies`, you can use a tool which is **PyChromeDevTools**. I have implemented a method to automatically get cookies from any URL. Visit: https://stackoverflow.com/questions/59045550/cant-parse-the-username-to-make-sure-im-logged-in-to-a-website/59196651#59196651 – Muhammad Usman Bashir Mar 11 '20 at 07:51