Parsing HTML using beautifulsoup gives "None"

Question

I can clearly see the tag I need in order to get the data I want to scrape.

According to multiple tutorials I am doing exactly the same way.

So why it gives me "None" when I simply want to display code between li class

from bs4 import BeautifulSoup
import requests

    response = requests.get("https://www.governmentjobs.com/careers/sdcounty")
    soup = BeautifulSoup(response.text,'html.parser')

    job = soup.find('li', attrs = {'class':'list-item'})
    print(job)

The short answer as a comment: You can only get the html page through that link, but unfortunately, the content is inserted into the page dynamically through JavaScript. Which means the page you get doesn't even contain those elements. — Sraw, Nov 09 '19 at 22:28

score 2 · Accepted Answer · answered Nov 09 '19 at 22:25

2

Whilst the page does dynamically update (it makes additional requests from browser to update content which you don't capture with your single request) you can find the source URI in the network tab for the content of interest. You also need to add the expected header.

import requests
from bs4 import BeautifulSoup as bs

headers = {'X-Requested-With': 'XMLHttpRequest'}
r = requests.get('https://www.governmentjobs.com/careers/home/index?agency=sdcounty&sort=PositionTitle&isDescendingSort=false&_=', headers=headers)
soup = bs(r.content, 'lxml')
print(len(soup.select('.list-item')))

answered Nov 09 '19 at 22:25

QHarr

83,427
12
54
101

Thanks @QHarr. So if I do those tricks as you did, I can simply proceed parsing by inspecting HTML code from this link: `https://www.governmentjobs.com/careers/sdcounty/index` ?? – Serdia Nov 09 '19 at 22:43
you have all that you need in the soup object. You can write that to an editor to inspect if you want. – QHarr Nov 09 '19 at 22:58
You would want to tidy up how you parse perhaps in a loop over e.g. for i in soup.select('.list-item'): # so something – QHarr Nov 09 '19 at 22:59

score 1 · Answer 2 · answered Nov 09 '19 at 22:17

1

There is no such content in the original page. The search results which you're referring to, are loaded dynamically/asynchronously using JavaScript.

Print the variable response.text to verify that. I got the result using ReqBin. You'll find that there's no text list-item inside.

Unfortunately, you can't run JavaScript with BeautifulSoup .

answered Nov 09 '19 at 22:17

Thomas Weller

55,411
20
125
222

Sorry, I am new to scraping. What do you mean no content on original page? this would be original page: `https://www.governmentjobs.com`. But I don't need original page, I need `https://www.governmentjobs.com/careers/sdcounty` page, which is for sure has that content because I can see it. – Serdia Nov 09 '19 at 22:23
@Oleg: put `https://www.governmentjobs.com/careers/sdcounty` as URL on the page [ReqBin](https://reqbin.com/) and you'll find that it is *not* on the original page. You can see it in the browser, because it is done by JavaScript at a later point in time, not noticeable for humans. – Thomas Weller Nov 09 '19 at 22:44
Thanks. Learned something new, for sure! – Serdia Nov 10 '19 at 19:59

score 1 · Answer 3 · answered Nov 09 '19 at 22:56

Another way to handle dynamically loading data is to use selenium instead of requests to get the page source. This should wait for the Javascript to load the data correctly and then give you the according html. This can be done like so:

from bs4 import BeautifulSoup
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options

url = "<URL>"

chrome_options = Options()  
chrome_options.add_argument("--headless") # Opens the browser up in background

with Chrome(options=chrome_options) as browser:
     browser.get(url)
     html = browser.page_source

soup = BeautifulSoup(html, 'html.parser')
job = soup.find('li', attrs = {'class':'list-item'})
print(job)

Parsing HTML using beautifulsoup gives "None"

3 Answers3