I have an HTML file that I curl and download with Python. However, I don't know how to get the data that I want out of it. I've used BS to get value from XML files but never something like this. Here is the section of the file I'm trying to read and grab:
<script>
var AC = {};
AC.org_json =
{
"id": "manager",
"children": [
{
"id": "employee1",
"children": [],
"data": {
"direct_reports": 0,
"badge_color": "F",
"badge_url": "https://someurl",
"full_name": "Employee1 Name",
"job_title": "Employee Job Title",
"department_name": "IT",
"building": "SITE1",
"phone": null,
"expanded": false
}
},
{
"id": "employee2",
"children": [],
"data": {
"direct_reports": 0,
"badge_color": "F",
"badge_url": "https://someurl",
"full_name": "Employee2 Name",
"job_title": "Employee Job Title",
"department_name": "IT",
"building": "SITE1",
"phone": null,
"expanded": false
}
},
......continues for however many entries there are.
</script>
The goal is to grab the "id" and the "job_title" of each entry. I just need some help getting started in the right direction. Any help is appreciated. Thank you.
EDIT: I was able to get the data in the tags separate from the HTML file.
from bs4 import BeautifulSoup
#opens data file
get_data = open(html,'r').read()
soup = BeautifulSoup(get_data)
title = soup.find("div", id="content")
json_data = title.find_next("script")
print json_data
and it gives me the exact output above. Next question is though how do I get the values from that data? If I do:
data = json.loads(json_data)
print data
Then i get: ValueError: No JSON object could be decoded