-1

I have an HTML file that I curl and download with Python. However, I don't know how to get the data that I want out of it. I've used BS to get value from XML files but never something like this. Here is the section of the file I'm trying to read and grab:

<script>
var AC = {};
AC.org_json = 
{
    "id": "manager",
    "children": [
        {
            "id": "employee1",
            "children": [],
            "data": {
                "direct_reports": 0,
                "badge_color": "F",
                "badge_url": "https://someurl",
                "full_name": "Employee1 Name",
                "job_title": "Employee Job Title",
                "department_name": "IT",
                "building": "SITE1",
                "phone": null,
                "expanded": false
            }
        },
        {
            "id": "employee2",
            "children": [],
            "data": {
                "direct_reports": 0,
                "badge_color": "F",
                "badge_url": "https://someurl",
                "full_name": "Employee2 Name",
                "job_title": "Employee Job Title",
                "department_name": "IT",
                "building": "SITE1",
                "phone": null,
                "expanded": false
            }
        },
      ......continues for however many entries there are.
</script>

The goal is to grab the "id" and the "job_title" of each entry. I just need some help getting started in the right direction. Any help is appreciated. Thank you.

EDIT: I was able to get the data in the tags separate from the HTML file.

from bs4 import BeautifulSoup
#opens data file
get_data = open(html,'r').read()
soup = BeautifulSoup(get_data)
title = soup.find("div", id="content")
json_data = title.find_next("script")
print json_data

and it gives me the exact output above. Next question is though how do I get the values from that data? If I do:

data = json.loads(json_data)
print data

Then i get: ValueError: No JSON object could be decoded

dkeeper09
  • 537
  • 3
  • 11
  • 29

2 Answers2

1

Here's what I would do:

  1. Use BeautifulSoup4 to parse the HTML file
  2. Run soup.find_all('script') to get all the script tags.
  3. Iterate over the list of the script tags, extract their text, pass the text to json.loads(), and then get the values from dictionary returned.

If you know there's only the one script tag, #3 is pretty easy. If there's a chance there's other script blocks with lots of non-JSON javascript, you'll prolly need to use some regex or else a try/catch block becauese json.loads () will probably error if you pass it a string that's not JSON.

aglensmith
  • 46
  • 2
  • Yep that works great. So when I do a `json.loads()` on that I get this error: `TypeError: expected string or buffer` – dkeeper09 Apr 07 '18 at 03:29
  • json.loads takes a string as an argument and loads that string into a python dictionary and returns the dict. Make sure you are passing the contents of the script tag as a string and not the tag itself, which is a bs4 tag object. For example, do this: json.loads(script_tag.string). Don't do this: json.loads(script_tag). – aglensmith Apr 07 '18 at 04:22
  • One thing I just thought of: once you get the string, you'll need to extract just the JSON portion for json.loads. To do that, you could use regex, like other commenters mentioned. Another simple way would be to use string.split('=') on the script tag's string, and then get the last element, so something like: script_dict = json.loads (script_tag_string.split('=')[-1]) – aglensmith Apr 07 '18 at 04:35
0

You are trying to parse a javascript dictionary (JSON) from within another language (HTML) so ideally you'd load the html with a real parser and then the javascript with a real parser but if you absolutely know your file format then you can hack up some regexes to remove everything except the JSON and then use json.loads() to parse it into a python dictionary.

guidoism
  • 7,820
  • 8
  • 41
  • 59
  • SO i'm guessing something like this is what I'm looking for: https://stackoverflow.com/questions/2835559/parsing-values-from-a-json-file The next question is how do I get the full json out of the html I guess, right? – dkeeper09 Apr 07 '18 at 02:01
  • Like I said, you might want to try hacking up some regex to remove everything except for the JSON. Take a look at https://docs.python.org/2/library/re.html since it sounds like you didn't quite understand my answer. But I will warn you that regexes are regular for regular languages and html and javascript are definitely not regular in the parsing sense so you might run into trouble. See https://en.wikipedia.org/wiki/Regular_language for why this is not easy. You might be able to use one of the html parsing libraries to get the script tag out, though. – guidoism Apr 07 '18 at 02:09