1

I can clearly see the tag I need in order to get the data I want to scrape.

According to multiple tutorials I am doing exactly the same way.

So why it gives me "None" when I simply want to display code between li class

from bs4 import BeautifulSoup
import requests

    response = requests.get("https://www.governmentjobs.com/careers/sdcounty")
    soup = BeautifulSoup(response.text,'html.parser')

    job = soup.find('li', attrs = {'class':'list-item'})
    print(job)

enter image description here

Serdia
  • 4,242
  • 22
  • 86
  • 159
  • The short answer as a comment: You can only get the html page through that link, but unfortunately, the content is inserted into the page dynamically through JavaScript. Which means the page you get doesn't even contain those elements. – Sraw Nov 09 '19 at 22:28

3 Answers3

2

Whilst the page does dynamically update (it makes additional requests from browser to update content which you don't capture with your single request) you can find the source URI in the network tab for the content of interest. You also need to add the expected header.

import requests
from bs4 import BeautifulSoup as bs

headers = {'X-Requested-With': 'XMLHttpRequest'}
r = requests.get('https://www.governmentjobs.com/careers/home/index?agency=sdcounty&sort=PositionTitle&isDescendingSort=false&_=', headers=headers)
soup = bs(r.content, 'lxml')
print(len(soup.select('.list-item')))
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Thanks @QHarr. So if I do those tricks as you did, I can simply proceed parsing by inspecting HTML code from this link: `https://www.governmentjobs.com/careers/sdcounty/index` ?? – Serdia Nov 09 '19 at 22:43
  • you have all that you need in the soup object. You can write that to an editor to inspect if you want. – QHarr Nov 09 '19 at 22:58
  • You would want to tidy up how you parse perhaps in a loop over e.g. for i in soup.select('.list-item'): # so something – QHarr Nov 09 '19 at 22:59
1

There is no such content in the original page. The search results which you're referring to, are loaded dynamically/asynchronously using JavaScript.

Print the variable response.text to verify that. I got the result using ReqBin. You'll find that there's no text list-item inside.

Unfortunately, you can't run JavaScript with BeautifulSoup .

Thomas Weller
  • 55,411
  • 20
  • 125
  • 222
  • Sorry, I am new to scraping. What do you mean no content on original page? this would be original page: `https://www.governmentjobs.com`. But I don't need original page, I need `https://www.governmentjobs.com/careers/sdcounty` page, which is for sure has that content because I can see it. – Serdia Nov 09 '19 at 22:23
  • @Oleg: put `https://www.governmentjobs.com/careers/sdcounty` as URL on the page [ReqBin](https://reqbin.com/) and you'll find that it is *not* on the original page. You can see it in the browser, because it is done by JavaScript at a later point in time, not noticeable for humans. – Thomas Weller Nov 09 '19 at 22:44
  • Thanks. Learned something new, for sure! – Serdia Nov 10 '19 at 19:59
1

Another way to handle dynamically loading data is to use selenium instead of requests to get the page source. This should wait for the Javascript to load the data correctly and then give you the according html. This can be done like so:

from bs4 import BeautifulSoup
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options

url = "<URL>"

chrome_options = Options()  
chrome_options.add_argument("--headless") # Opens the browser up in background

with Chrome(options=chrome_options) as browser:
     browser.get(url)
     html = browser.page_source

soup = BeautifulSoup(html, 'html.parser')
job = soup.find('li', attrs = {'class':'list-item'})
print(job)
Max Kaha
  • 902
  • 5
  • 12