3

The code below is what I have so far, but it only pulls data for the first 25 items, which are the first 25 items on the page before scrolling down for more:

import requests
from bs4 import BeautifulSoup
import time
import pandas as pd

start_time = time.time()
s = requests.Session()

#Get URL and extract content
response = s.get('https://www.linkedin.com/jobs/search?keywords=It%20Business%20Analyst&location=Boston%2C%20Massachusetts%2C%20United%20States&geoId=102380872&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0')
soup = BeautifulSoup(response.text, 'html.parser')

items = soup.find('ul', {'class': 'jobs-search__results-list'})
job_titles = [i.text.strip('\n ') for i in items.find_all('h3', {'class': 'base-search-card__title'})]
job_companies = [i.text.strip('\n ') for i in items.find_all('h4', {'class': 'base-search-card__subtitle'})]
job_locations = [i.text.strip('\n ') for i in items.find_all('span', {'class': 'job-search-card__location'})]
job_links = [i["href"].strip('\n ') for i in items.find_all('a', {'class': 'base-card__full-link'})]

a = pd.DataFrame({'Job Titles': job_titles})
b = pd.DataFrame({'Job Companies': job_companies})
c = pd.DataFrame({'Job Locations': job_locations})

value_counts1 = a['Job Titles'].value_counts()
value_counts2 = b['Job Companies'].value_counts()
value_counts3 = c['Job Locations'].value_counts()

l1 = [f"{key} - {value_counts1[key]}" for key in value_counts1.keys()]
l2 = [f"{key} - {value_counts2[key]}" for key in value_counts2.keys()]
l3 = [f"{key} - {value_counts3[key]}" for key in value_counts3.keys()]

data = l1, l2, l3
df = pd.DataFrame(
    data, index=['Job Titles', 'Job Companies', 'Job Locations'])

df = df.T

print(df)
print("--- %s seconds ---" % (time.time() - start_time))

I would like to pull data for more than the first 25 items, is there an efficient way of being able to do this?

2 Answers2

2

Get the container that holds the desired data by inspecting and you can scrape from the infinite scroll page with Selenium web driver using window.scrollTo()

check this for more >

crawl site that has infinite scrolling using python

or this web-scraping-infinite-scrolling-with-selenium

nabroleonx
  • 96
  • 5
1

The best way is to create a function to scroll down:

# Scroll function
# This function takes two arguments. The driver that is being used and a timeout.
# The driver is used to scroll and the timeout is used to wait for the page to load.

def scroll(driver, timeout):
    scroll_pause_time = timeout

    # Get scroll height
    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load page
        time.sleep(scroll_pause_time)

        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            # If heights are the same it will exit the function
            break
        last_height = new_height

Then you can use the scroll function to scroll desidered page:

import time
import pandas as pd
from seleniumwire import webdriver  


# Create a new instance of the Firefox driver
driver = webdriver.Firefox()

# move to some url
driver.get('your_url')


# use "scroll" function to scroll the page every 5 seconds
scroll(driver, 5)
BlackMath
  • 1,708
  • 1
  • 11
  • 14