Where does data not in a website's source code come from and how do I get it using BeautifulSoup?

Question

I am trying to pull data from a local government's website using BeautifulSoup with Python, but the source code that it pulls down lacks the info I want. I know how to use BeautifulSoup and I can pull any part of the source code I want down and use it in python, but the data I want is not there. What happens is the page has all of the tags laid out with their appropriate id, yet there is no value. I see this every time I go to the page source on Chrome. Every time I go to the inspected page, the data is put in where you would think it would be to render the page. Some of the data that is blank in the source but there in the inspect page does not have an id on the <td> tag. It has a plain, untouched <td>.

I know the website pulls the data from a database because I someone who helped created the database that it pulls the data from. I have talked to them, and they do not know how to get it. As the title says, how is the data being entered, and how to I access it?

If the data is being loaded after the page loads, you'll need to use something like Selenium to execute the JavaScript that does the request. — Carcigenicate, Dec 06 '19 at 02:08
@Carcigenicate Do you happen to know if there is there a canonical "you need to use Selenium" wiki dupe for this? — ggorlen, Dec 06 '19 at 02:20
@ggorlen No. I was in the bus and switching between the app and browser is a pain. I can look now. There must be though. This gets asked like daily. — Carcigenicate, Dec 06 '19 at 02:23
@Carcigenicate Sorry for pinging you late in an old post, but did you ever find anything regarding a canonical answer? After seeing a brand new post on the topic, I was able to dig up in a matter of minutes 11 other questions on the exact same topic. There is a popular [question](https://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python) which seems like it could be the one but, crucially, people will only really find it if they **already know** what the issue is. If there is none, I was thinking of creating one myself, maybe include a basic web scraping workflow. — AMC, Jan 05 '20 at 02:28
In addition to the discoverability issue, it would feel a bit bizarre to flag posts as duplicate, since the "original" question is more so the answer to the new post (geez that's messy, I hope I explained it clearly enough). There is [another post](https://stackoverflow.com/q/44867425/11301900) which would be more appropriate, but it's just an average question, unpopular and unpolished. Is this the sort of thing I should bring up in the Python chat? — AMC, Jan 05 '20 at 02:28

score 0 · Answer 1 · answered Dec 06 '19 at 02:39

Like the others have stated, you cannot see the data because it is being generated by JavaScript. To work around this, you will need to use something like Selenium or Splash to render the JavaScript first.

I will provide an example using selenium as selenium is a bit more user friendly to use. Here are some great resources to get started.

https://pythonspot.com/selenium-get-source/

https://selenium-python.readthedocs.io/installation.html

from selenium import webdriver
from bs4 import BeautifulSoup



URL = "your url here"

options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument("--test-type")
options.binary_location = "/usr/bin/chromium"
driver = webdriver.Chrome(chrome_options=options)

driver.get(URL)

html = driver.page_source
soup = BeautifulSoup('html.parser', html) 
"""
Do your desired parsing
"""

Where does data not in a website's source code come from and how do I get it using BeautifulSoup?

1 Answers1

Linked