I want to extract multiple independent JSON objects and associated keys from a web page. By "independently nested," I mean each JSON object is nested within a script type = "application/ld+json
element.
I am currently using beautifulsoup, json, and requests to try and accomplish this task, but I can't get it to work. I have read through similar posts (e.g., here, here, and here), but none of them address this issue. Specifically, how to extract multiple independently nested JSON objects simultaneously and then extract specific keys from among those objects. Other examples assume the JSON objects are all within one nest.
Here is a working example of where I am currently at:
# Using Python 3.8.1, 32 bit, Windows 10
from bs4 import BeautifulSoup
import requests
import json
#%% Create variable with website location
reno = 'https://www.foodpantries.org/ci/nv-reno'
#%% Downlod the webpage
renoContent = requests.get(reno)
#%% Make into nested html
renoHtml = BeautifulSoup(renoContent.text, 'html.parser')
#%% Keep only the HTML that contains the JSON objects I want
spanList = renoHtml.find("div", class_="span8")
#%% Get JSON objects.
data = json.loads(spanList.find('script', type='application/ld+json').text)
print(data)
This is where I am stuck. I can get the JSON data for the first location, however, I can't get it for the other 9 locations that are listed in the spanList
variable. How can I have Python get me the JSON data from the other 9 locations? I did try spanList.find_all
but that returns a AttributeError: ResultSet object has no attribute 'text'
. But if I remove .text
from json.loads
, I get TypeError: the JSON object must be str, bytes or bytearray, not ResultSet
.
My hunch is that this is complicated because each JSON object has its own script type = "application/ld+jso
attribute. None of the other examples I saw had a similar situation. It seems json.loads
is only recognizing that first JSON object and then stopping.
The other complication is that the number of locations changes based on the city. I am hoping there is a solution that will automatically pull all the locations no matter how many are on the page (e.g., Reno has 10 but Las Vegas has 20).
I also couldn't figure out how to extract the keys from this JSON load using the key names such as name
and streetAddress.
This could be based on how how I am extracting the JSON object via json.dumps
but I am unsure.
Here is an example of how the JSON object is laid out
<script type = "application/ld+json">
{
"@context": "https://schema.org",
"@type": "LocalBusiness",
"address": {
"@type":"PostalAddress",
"streetAddress":"2301 Kings Row",
"addressLocality":"Reno",
"addressRegion":"NV",
"postalCode": "89503"
},
"name": "Desert Springs Baptist Church"
,"image":
"https://www.foodpantries.org/gallery/28591_desert_springs_baptist_church_89503_wzb.jpg"
,"description": "Provides a food pantry. Must provide ID and be willing to fill out intake
form Pantry.Hours: Friday 11:00am - 12:00pmFor more information, please call. "
,"telephone":"(775) 746-0692"
}
My ultimate goal is to export the data contained within the keys name
, streetAddress
, addressLocality
, addressRegion
, and postalCode
to a CSV file.