1

I have a very simple problem. I'm trying to get the job description from the html of a linkedIn page, but instead of getting the html of the page I'm getting few lines that look like a javascript code instead. I'm very new to this so any help will be greatly appreciated! Thanks

Here's my code:

import requests
url = "https://www.linkedin.com/jobs/view/inside-sales-manager-at-stericycle-1089095836/"
page_html = requests.get(url).text
print(page_html)

When I run this I don't get the html that I expect containing the job description...I just get few lines of javascript code instead.

Mureinik
  • 297,002
  • 52
  • 306
  • 350
Chadee Fouad
  • 2,630
  • 2
  • 23
  • 29
  • 1
    Seems to work for me. Can you share the output you're getting? – Mureinik Jan 28 '19 at 06:03
  • 1
    The JS code that is returned is actually function bound to the window.onLoad() event. I suspect that the rest of the page is loaded using client side code so you need to execute that first to retrieve the produced HTML. Possible solution can be found in [here](https://stackoverflow.com/questions/29996001/using-python-requests-get-to-parse-html-code-that-does-not-load-at-once) – s.feradov Jan 28 '19 at 06:21
  • Possible duplicate of [Using Python requests.get to parse html code that does not load at once](https://stackoverflow.com/questions/29996001/using-python-requests-get-to-parse-html-code-that-does-not-load-at-once) – bruno desthuilliers Jan 28 '19 at 12:17
  • Thank you sir I appreciate your response...what you said makes a lot of sense and I think you're right. I finally got it working with Selenium...it's a much easier solution than Beautiful Soup. Have a great day! :-) – Chadee Fouad Jan 29 '19 at 19:13
  • 1
    Anwarvic is completely right. Would just like to add that webdriver_path needs to be defined or you'll get a "can't find chromedriver error." – Alciore Jun 29 '20 at 15:04

2 Answers2

7

Some websites present different content based on the type of browser that is accessing the site. LinkedIn is a perfect example of such behavior. If the browser has advanced capabilities, the website may present “richer” content – something more dynamic and styled. And using the bot won't help to see these websites.

To solve this problem, you need to follow these steps:

  1. Download chrome-driver from here. Choose the one that matches your OS.
  2. Extract the driver and put it in a certain directory. For example, \usr
  3. Install Selenium which is a python module by running pip install selenium. Note that, selenium depends on another package called msgpack. So, you should install it first using this command pip install msgpack.
  4. Now, we are ready to run the following code
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options


def create_browser(webdriver_path):
    #create a selenium object that mimics the browser
    browser_options = Options()
    #headless tag created an invisible browser
    browser_options.add_argument("--headless")
    browser_options.add_argument('--no-sandbox')
    browser = webdriver.Chrome(webdriver_path, chrome_options=browser_options)
    print("Done Creating Browser")
    return browser


url = "https://www.linkedin.com/jobs/view/inside-sales-manager-at-stericycle-1089095836/"
browser = create_browser('/usr/chromedriver') #DON'T FORGET TO CHANGE THIS AS YOUR DIRECTORY
browser.get(url)
page_html = browser.page_source
print(page_html[-10:]) #prints dy></html>

Now, you have the whole page. I hope this answers your question!!

Anwarvic
  • 12,156
  • 4
  • 49
  • 69
  • Thanks Anwar for the detailed reply...very much appreciated :-)...yes that works! Selenium is a much better and easier tool – Chadee Fouad Jan 29 '19 at 20:01
0

Due to the several major updates (deletion of straight path to the driver from parameters and addition of browser path to the services - it is usually situated in C:/Program Files (x86)/Google/Chrome/Application/chrome.exe) to the Selenium previous answer can be updated to avoid problems with Selenium 4+:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options, Service


def create_browser(webdriver_path, browser_path):

    # Create a Selenium object that mimics the browser and give a link to real browser
    browser_options = Options()
    browser_service = Service(webdriver_path)

    # Headless tag created an invisible browser
    browser_options.add_argument("--headless")
    browser_options.add_argument('--no-sandbox')
    browser_options.binary_location = browser_path

    # Launch browser with the given settings
    browser = webdriver.Chrome(service = browser_service, options = browser_options)
    print("Done Creating Browser")
    return browser


url = "https://www.linkedin.com/jobs/view/inside-sales-manager-at-stericycle-1089095836/"
browser = create_browser('/some_path/chromedriver.exe', '/some_other_path/chrome.exe') 
browser.get(url)
page_html = browser.page_source
print(page_html[-10:]) #prints dy></html>

I recommend new readers to recheck Selenium documentation because this library is significantly changed every 2-3 years, which can lead to the fast obsolesence of the answers here.