How to get missing HTML data when web scraping with python-requests

Question

I am working on building a job board which involves scraping job data from company sites. I am currently trying to scrape Twilio at https://www.twilio.com/company/jobs. However, I am not getting the job data its self -- that seems to be being missed by the scraper. Based on other questions this could be because the data is in JavaScript, but that is not obvious.

Here is the code I am using:

# Set the URL you want to webscrape from
url = 'https://www.twilio.com/company/jobs'

# Connect to the URL
response = requests.get(url)

if "_job-title" in response.text:
    print "Found the jobs!"    # FAILS

# Parse HTML and save to BeautifulSoup object
soup = BeautifulSoup(response.text, "html.parser")

# To download the whole data set, let's do a for loop through all a tags
for i in range(0,len(soup.findAll('a', class_='_job'))): # href=True))): #'a' tags are for links
    one_a_tag = soup.findAll('a', class_='_job')[i]
    link = one_a_tag['href']
    print link            # FAILS

Nothing displays when this code is run. I have tried using urllib2 as well and that has the same problem. Selenium works but it is too slow for the job. Scrapy looks like it could be promising but I am having install issues with it.

Here is a screenshot of the data I am trying to access:

did you try disabling js in browser then re-loading page and seeing if content is there? — QHarr, Oct 05 '19 at 18:34
Next, re-enable js and open network tab of browser and press Ctrl + F to open search box. Press F5 to refresh page. After page has loaded enter a unique value (if possible) into search box and hit enter - use this method to see if you can find any calls the page is making to get that content that you can call yourself with requests. — QHarr, Oct 05 '19 at 18:49
Interesting. Under network I searched for a unique job and found that there is a name "jobs" that comes up with what I searched for. Its domain is www.twilio.com though, which isn't helpful. — ryankuck, Oct 05 '19 at 19:12
Examine the whole thing - is it a POST request with a body shown? — QHarr, Oct 05 '19 at 19:14
What comes up is an html file that contains all the data. It is what I am get when I just inspect element on the page. I added a screenshot of what it looks like above. — ryankuck, Oct 05 '19 at 19:19

QHarr · Accepted Answer · 2019-10-05T19:54:35.967

2

Basic info for all the jobs at different offices comes back dynamically from an API call you can find in network tab. If you extract the ids from that you can then make separate requests for the detailed job info using those ids. Example as shown:

import requests
from bs4 import BeautifulSoup as bs

listings = {}

with requests.Session() as s:
    r = s.get('https://api.greenhouse.io/v1/boards/twilio/offices').json()
    for office in r['offices']:
        for dept in office['departments']: #you could perform some filtering here or later on 
            if 'jobs' in dept:
                for job in dept['jobs']:
                    listings[job['id']] = job  #store basic job info in dict
    for key in listings.keys():
        r = s.get(f'https://boards.greenhouse.io/twilio/jobs/{key}')
        soup = bs(r.content, 'lxml')
        job['soup'] = soup #store soup from detail page
        print(soup.select_one('.app-title').text) #print example something from page

edited Oct 05 '19 at 19:54

answered Oct 05 '19 at 19:39

QHarr

83,427
12
54
101

1

Awesome, could you explain your method for finding the api call one more time though? – ryankuck Oct 05 '19 at 19:49
1

See [1](https://stackoverflow.com/a/56279841/6241235) and [2](https://stackoverflow.com/a/56924071/6241235) – QHarr Oct 05 '19 at 19:53
What key words would you recommend searching for? I used job titles and those did not come up with js files. Also will the api call always have 'api' in it -- is 'api' a good search term to use? – ryankuck Oct 05 '19 at 23:51
Also will the api call be in a js file for sure, because I have gone through them all on this site for instance: https://careers.airbnb.com/#jobs and not found the call. – ryankuck Oct 06 '19 at 00:11
You want your search term to be unique if possible i.e. something that should only appear visibly on the page in the section of interest and be unlikely to occur elsewhere is sources. Short numbers are not usually good. I started by clicking on any job link to how detailed info was being captured and saw that there appeared to be a url generated with id on the end. I then searched for that id in the web traffic. – QHarr Oct 06 '19 at 04:45

How to get missing HTML data when web scraping with python-requests

1 Answers1

Linked