How to scrape a dynamically loaded JavaScript page in Python?

Question

Bottom line up front: I want to scrape the jobs from this website: https://www.gdit.com/careers/search/?q=bossier%20city, but I keep getting the javascript base page. If you inspect the page, you can see the jobs are listed in h3 tags but no matter what I do, the jobs don't pull up.

I tried the following beautiful soup code:

url = "https://www.gdit.com/careers/search/?q=bossier%20city"
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, "html.parser")
print(soup) #  for testing purposes
for job in soup.find_all('h3'):
     print(job)

I tried ScraperAPI which I thought was supposed to load javascript for you:

url = "https://www.gdit.com/careers/search/?q=bossier%20city"
params = {'api_key': "MY-API-KEY", 'url': url}
response = requests.get('http://api.scraperapi.com/', params=params)
print(response.text) #  No H3 tags of any kind

I tried html-requests:

session = HTMLSession()
r = session.get("https://www.gdit.com/careers/search/?q=bossier%20city")
data = r.html.render()
print(data)

I tried Selenium first and then parsing it to beautifulsoup:

global driver
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("detach", True)
options.add_experimental_option('useAutomationExtension', False)
try:
    driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Users\Notebook\Documents\chromedriver.exe')
    driver.get(url)
    page_source = driver.page_source
    soup = BeautifulSoup(page_source, "html.parser")
    time.sleep(2)
    print(soup)
except exceptions.WebDriverException:
    print("You need to download a new version of the Chromedriver.")

Nothing works. Do I have to mimic a user entering Bossier City first and then retrieve the return? Anyways, any help would be appreciated.

I can't. Stack limits you to only 30,000 lines and the full DOM is 477K lines of code. It pulls up all the HTML, Javascript, and CSS for the whole site. — Brandon Jacobson, Oct 17 '21 at 15:16

score 1 · Answer 1 · edited Oct 18 '21 at 07:54

1

I would suggest switching from BeautifulSoup (static loader, purely python based) to Selenium (dynamic loader, integrates into multiple webbrowsers like: chrome, firefox, etc, etc).

Learn More Here

Selenium is used for Automation testing on websites, however it can be used to scrape advanced dynamic websites.

it provides many features, from reading DOM values, to adding/remove or editing DOM elements, also you can wait for an element to come into existance by waiting for that element to appear or render.

driver.page_source only loads the base html and not the dynamic javascript. if you just print(driver.page_source) you will see what data is avaliable, adding time.sleep(10)

edited Oct 18 '21 at 07:54

DisappointedByUnaccountableMod

6,656
4
18
22

answered Oct 17 '21 at 15:08

Dean Van Greunen

5,060
2
14
28

I did try Selenium and it didn't work. Please see #4 in my post. – Brandon Jacobson Oct 17 '21 at 15:10
@BrandonJacobson don't use BS with Selenium, use either one, not both. – Dean Van Greunen Oct 17 '21 at 15:11
@BrandonJacobson driver.page_source only loads the base html and not the dynamic javascript. if you just `print(driver.page_source)` you will see what data is avaliable – Dean Van Greunen Oct 17 '21 at 15:12
The same thing happens. For example, the first job listed is NESD and that doesn't pull up if I just print the driver.page_source. – Brandon Jacobson Oct 17 '21 at 15:19
1

perhaps the driver.page_source isn't loaded fully yet, perhaps when using Selenium, wait for the document ready load event happens. – Dean Van Greunen Oct 17 '21 at 15:20
1

You're right. I put in a 10 second long time.sleep(10) and it worked. Thanks! – Brandon Jacobson Oct 17 '21 at 17:05
You're welcome, thanks for marking my answer as the correct one ;) – Dean Van Greunen Oct 18 '21 at 05:51

score 0 · Answer 2 · answered Oct 17 '21 at 15:20

I think your problem is simple. As you said this page is loading elements dynamically using JS.

Selenium simply waits for the html to load. And does not wait for any scripts to finish running.

In order to wait for a specific element all you have to do is add this functionality to your code (Selenium supports this). Here's a great post explaining this. This post explains how you can wait for a specific element to become intractable which is one step further than -what I'm guessing- you require.

How to scrape a dynamically loaded JavaScript page in Python?

2 Answers2