How do I create a dataframe of jobs and companies that includes hyperlinks?

Question

I am making a function to print a list of links so I can add them to a list of companies and job titles. However, I am having difficulties navigating tag sub-contents. I am looking to list all the 'href' in 'a' in 'div' like so:

from bs4 import BeautifulSoup
import re
import pandas as pd
import requests


page = "https://www.indeed.com/q-software-developer-l-San-Francisco-jobs.html"
headers = {'User-Agent':'Mozilla/5.0'}

def get_soup():
    session = requests.Session()
    pageTree = session.get(page, headers=headers)
    return BeautifulSoup(pageTree.content, 'html.parser')

pageSoup = get_soup()

def print_links(): 
     """this function scrapes the job title links""" 
    jobLink = [div.a for div in pageSoup.find_all('div', class_='title')]
    for div in jobLink: 
        print(div['href'])

I am trying to make a list but my result is simply text and does not seem to be a link like so:

/pagead/clk?mo=r&ad=-6NYlbfkN0DhVAxkc_TxySVbUOs6bxWYWOfhmDTNcVTjFFBAY1FXZ2RjSBnfHw4gS8ZdlOOq-xx2DHOyKEivyG9C4fWOSDdPgVbQFdESBaF5zEV59bYpeWJ9R8nSuJEszmv8ERYVwxWiRnVrVe6sJXmDYTevCgexdm0WsnEsGomjLSDeJsGsHFLAkovPur-rE7pCorqQMUeSz8p08N_WY8kARDzUa4tPOVSr0rQf5czrxiJ9OU0pwQBfCHLDDGoyUdvhtXy8RlOH7lu3WEU71VtjxbT1vPHPbOZ1DdjkMhhhxq_DptjQdUk_QKcge3Ao7S3VVmPrvkpK0uFlA0tm3f4AuVawEAp4cOUH6jfWSBiGH7G66-bi8UHYIQm1UIiCU48Yd_pe24hfwv5Hc4Gj9QRAAr8ZBytYGa5U8z-2hrv2GaHe8I0wWBaFn_m_J10ikxFbh6splYGOOTfKnoLyt2LcUis-kRGecfvtGd1b8hWz7-xYrYkbvs5fdUJP_hDAFGIdnZHVJUitlhjgKyYDIDMJ-QL4aPUA-QPu-KTB3EKdHqCgQUWvQud4JC2Fd8VXDKig6mQcmHhZEed-6qjx5PYoSifi5wtRDyoSpkkBx39UO3F918tybwIbYQ2TSmgCHzGm32J4Ny7zPt8MPxowRw==&p=0&fvj=1&vjs=3

Additionally, here is my attempt at making a list with the links:

def get_job_titles():
    """this function scrapes the job titles"""
    jobs = []
    jobTitle = pageSoup.find_all('div', class_='title')
    for span in jobTitle:
        link = span.find('href')
        if link:
            jobs.append({'title':link.text,
                          'href':link.attrs['href']})
        else:
            jobs.append({'title':span.text, 'href':None})
    return jobs

Hi @QHarr, added a url for soup function. – Oscar Evolves Aug 18 '19 at 06:20 — Oscar Evolves, Aug 18 '19 at 06:20

QHarr · Accepted Answer · 2019-08-18T09:57:55.320

I would regex out from html returned the required info and construct the url from the parameters the page javascript uses to dynamically construct each url. Interestingly, the total number of listings is different when using requests than using browser. You can manually enter the number of listings e.g. 6175 (currently) or use the number returned by the request (which is lower and you miss some results). You could also use selenium to get the correct initial result count). You can then issue requests with offsets to get all listings.

Listings can be randomized in terms of ordering.

It seems you can introduce a limit parameter to increase results_per_page up to 50 e.g.

https://www.indeed.com/jobs?q=software+developer&l=San+Francisco&limit=50&start=0

Furthermore, it seems that it is possible to retrieve more results that are actually given as the total results count on webpage.

py with 10 per page:

import requests, re, hjson, math
import pandas as pd
from bs4 import BeautifulSoup as bs

p = re.compile(r"jobmap\[\d+\]= ({.*?})")
p1 = re.compile(r"var searchUID = '(.*?)';") 
counter = 0 
final = {}

with requests.Session() as s:
    r = s.get('https://www.indeed.com/q-software-developer-l-San-Francisco-jobs.html#')
    soup = bs(r.content, 'lxml')
    tk = p1.findall(r.text)[0] 
    listings_per_page = 10
    number_of_listings = int(soup.select_one('[name=description]')['content'].split(' ')[0].replace(',',''))
    #number_of_pages = math.ceil(number_of_listings/listings_per_page)
    number_of_pages =  math.ceil(6175/listings_per_page) #manually calculated
    for page in range(1, number_of_pages + 1):
        if page > 1:
            r = s.get('https://www.indeed.com/jobs?q=software+developer&l=San+Francisco&start={}'.format(10*page-1))
            soup = bs(r.content, 'lxml')
            tk = p1.findall(r.text)[0] 

        for item in p.findall(r.text):
            data = hjson.loads(item)
            jk = data['jk']
            row = {'title' : data['title']
               ,'company' : data['cmp']
               ,'url' : f'https://www.indeed.com/viewjob?jk={jk}&tk={tk}&from=serp&vjs=3'
              }
            final[counter] = row
            counter+=1

df = pd.DataFrame(final)
output_df = df.T
output_df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8-sig',index = False )

If you want to use selenium to get correct initial listings count:

import requests, re, hjson, math
import pandas as pd
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()  
options.add_argument("--headless") 
d = webdriver.Chrome(r'C:\Users\HarrisQ\Documents\chromedriver.exe', options = options)
d.get('https://www.indeed.com/q-software-developer-l-San-Francisco-jobs.html#')
number_of_listings = int(d.find_element_by_css_selector('[name=description]').get_attribute('content').split(' ')[0].replace(',',''))
d.quit()
p = re.compile(r"jobmap\[\d+\]= ({.*?})")
p1 = re.compile(r"var searchUID = '(.*?)';") 
counter = 0 
final = {}

with requests.Session() as s:
    r = s.get('https://www.indeed.com/q-software-developer-l-San-Francisco-jobs.html#')
    soup = bs(r.content, 'lxml')
    tk = p1.findall(r.text)[0] 
    listings_per_page = 10
    number_of_pages =  math.ceil(6175/listings_per_page) #manually calculated
    for page in range(1, number_of_pages + 1):
        if page > 1:
            r = s.get('https://www.indeed.com/jobs?q=software+developer&l=San+Francisco&start={}'.format(10*page-1))
            soup = bs(r.content, 'lxml')
            tk = p1.findall(r.text)[0] 

        for item in p.findall(r.text):
            data = hjson.loads(item)
            jk = data['jk']
            row = {'title' : data['title']
               ,'company' : data['cmp']
               ,'url' : f'https://www.indeed.com/viewjob?jk={jk}&tk={tk}&from=serp&vjs=3'
              }
            final[counter] = row
            counter+=1

df = pd.DataFrame(final)
output_df = df.T
output_df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8-sig',index = False )

Thanks @QHarr , this looks great. I will begin using selenium soon. First, I should understand re.compile and hjson.loads better. — Oscar Evolves, Aug 18 '19 at 20:37
Also, how do I find out more about with and for loops? i.e.: `with requests.Session() as s:` and `for page in range(1, number_of_pages + 1):` — Oscar Evolves, Aug 18 '19 at 20:42
https://stackoverflow.com/a/1369553/6241235 and https://www.youtube.com/watch?v=6iF8Xb7Z3wQ — QHarr, Aug 18 '19 at 20:46

How do I create a dataframe of jobs and companies that includes hyperlinks?

1 Answers1