1

Bottom line up front: I want to scrape the jobs from this website: https://www.gdit.com/careers/search/?q=bossier%20city, but I keep getting the javascript base page. If you inspect the page, you can see the jobs are listed in h3 tags but no matter what I do, the jobs don't pull up.

  1. I tried the following beautiful soup code:
url = "https://www.gdit.com/careers/search/?q=bossier%20city"
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, "html.parser")
print(soup) #  for testing purposes
for job in soup.find_all('h3'):
     print(job)
  1. I tried ScraperAPI which I thought was supposed to load javascript for you:
url = "https://www.gdit.com/careers/search/?q=bossier%20city"
params = {'api_key': "MY-API-KEY", 'url': url}
response = requests.get('http://api.scraperapi.com/', params=params)
print(response.text) #  No H3 tags of any kind
  1. I tried html-requests:
session = HTMLSession()
r = session.get("https://www.gdit.com/careers/search/?q=bossier%20city")
data = r.html.render()
print(data)
  1. I tried Selenium first and then parsing it to beautifulsoup:
global driver
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("detach", True)
options.add_experimental_option('useAutomationExtension', False)
try:
    driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Users\Notebook\Documents\chromedriver.exe')
    driver.get(url)
    page_source = driver.page_source
    soup = BeautifulSoup(page_source, "html.parser")
    time.sleep(2)
    print(soup)
except exceptions.WebDriverException:
    print("You need to download a new version of the Chromedriver.")

Nothing works. Do I have to mimic a user entering Bossier City first and then retrieve the return? Anyways, any help would be appreciated.

Brandon Jacobson
  • 139
  • 3
  • 10

2 Answers2

1

I would suggest switching from BeautifulSoup (static loader, purely python based) to Selenium (dynamic loader, integrates into multiple webbrowsers like: chrome, firefox, etc, etc).

Learn More Here

Selenium is used for Automation testing on websites, however it can be used to scrape advanced dynamic websites.

it provides many features, from reading DOM values, to adding/remove or editing DOM elements, also you can wait for an element to come into existance by waiting for that element to appear or render.

driver.page_source only loads the base html and not the dynamic javascript. if you just print(driver.page_source) you will see what data is avaliable, adding time.sleep(10)

Dean Van Greunen
  • 5,060
  • 2
  • 14
  • 28
0

I think your problem is simple. As you said this page is loading elements dynamically using JS.

Selenium simply waits for the html to load. And does not wait for any scripts to finish running.

In order to wait for a specific element all you have to do is add this functionality to your code (Selenium supports this). Here's a great post explaining this. This post explains how you can wait for a specific element to become intractable which is one step further than -what I'm guessing- you require.

Nizar
  • 737
  • 6
  • 15