How to extract objects from nested lists from a Json file with Python?

Question

I have a response that I receive from Lobbyview in the form of json. I tried to put it in data frame to access only some variables, but with no success. How can I access only some variables such as the id and the committees in a format exportable to .dta ? Here is the code I have tried.

import requests, json
query = {"naics": "424430"}
results = requests.post('https://www.lobbyview.org/public/api/reports',
data = json.dumps(query))
print(results.json())

import pandas as pd
b = pd.DataFrame(results.json())

_id = data["_id"]
committee = data["_source"]["specific_issues"][0]["bills_by_algo"][0]["committees"]

An observation of the json looks like this:

"_score": 4.421936, 
"_type": "object", 
"_id": "5EZUMbQp3hGKH8Uq2Vxuke", 
"_source": 
    {
    "issue_codes": ["CPT"], 
    "received": 1214320148, 
    "client_name": "INTELLECTUAL PROPERTY OWNERS ASSOCIATION", 
    "amount": 240000, 
    "client": 
        {
        "legal_name": "INTELLECTUAL PROPERTY OWNERS ASSOCIATION", 
        "name": "INTELLECTUAL PROPERTY OWNERS ASSOCIATION", 
        "naics": null, 
        "gvkey": null, 
        "ticker": "Unlisted", 
        "id": null, 
        "bvdid": "US131283992L"}, 
    "specific_issues": [
        {
        "text": "H.R. 34, H.R. 1908, H.R. 2336, H.R. 3093  S. 522, S. 681, S. 1145, S. 1745", 
        "bills_by_algo": [
            {
            "titles": ["To amend title 35, United States Code, to provide for patent reform.", "Patent Reform Act of 2007", "Patent Reform Act of 2007", "Patent Reform Act of 2007"], 
            "top_terms": ["Commerce", "Administrative fees"], 
            "sponsor": 
                {
                "firstname": "Howard", 
                "district": 28, 
                "title": "rep", 
                "id": 400025
                }, 
            "committees": ["House Judiciary"], 
            "introduced": 1176868800, 
            "type": "HR", "id": "110_HR1908"}, 
            {
            "titles": ["To amend title 35, United States Code, relating to the funding of the United States Patent and Trademark Office."], 
            "top_terms": ["Commerce", "Administrative fees"], 
            "sponsor": 
                {
                "firstname": "Howard", 
                "district": 28, 
                "title": "rep", 
                "id": 400025
                }, 
            "committees": ["House Judiciary"], 
            "introduced": 1179288000, 
            "type": "HR", 
            "id": "110_HR2336"
        }],

        "gov_entities": ["U.S. House of Representatives", "Patent and Trademark Office (USPTO)", "U.S. Senate", "UNDETERMINED", "U.S. Trade Representative (USTR)"], 
        "lobbyists": ["Valente, Thomas Silvio", "Wamsley, Herbert C"], 
        "year": 2007, 
        "issue": "CPT", 
        "id": "S4nijtRn9Q5NACAmbqFjvZ"}], 
    "year": 2007, 
    "is_latest_amendment": true,
     "type": "MID-YEAR AMENDMENT", 
    "id": "1466CDCD-BA3D-41CE-B7A1-F9566573611A", 
    "alternate_name": "INTELLECTUAL PROPERTY OWNERS ASSOCIATION"
    }, 
"_index": "collapsed"}```

Markus Rosenfelder · Accepted Answer · 2020-02-01T00:37:06.597

Since the data that you specified is nested pretty deeply in the JSON-response, you have to loop through it and save it to a list temporarily. To understand the response data better, I would advice you to use some tool to look into the JSON structure, like this online JSON-Viewer. Not every entry in the JSON contains the necessary data, therefore I try to catch the error through a try and except. To make sure that the id and committees are matched correctly, I chose to add them as small dicts to the list. This list can then be read into Pandas with ease. Saving to .dta requires you to convert the lists inside the committees column to strings, instead you might also want to save as .csv for a more generally usable format.

import requests, json
import pandas as pd


query = {"naics": "424430"}
results = requests.post(
    "https://www.lobbyview.org/public/api/reports", data=json.dumps(query)
)


json_response = results.json()["result"]

# to save the JSON response
# with open("data.json", "w") as outfile:
#     json.dump(results.json()["result"], outfile)

resulting_data = []

# loop through the response
for data in json_response:
    # try to find entries with specific issues, bills_by_algo and committees
    try:
        # loop through the special issues
        for special_issue in data["specific_issues"]:
            _id = special_issue["id"]
            # loop through the bills_by_algo's
            for x in special_issue["bills_by_algo"]:
                # append the id and committees in a dict
                resulting_data.append(({"id": _id, "committees": x["committees"]}))

    except KeyError as e:
        print(e, "not found in entry.")
        continue


# create a DataFrame
df = pd.DataFrame(resulting_data)
# export of list objects in the column is not supported by .dta, therefore we convert
# to strings with ";" as delimiter
df["committees"] = ["; ".join(map(str, l)) for l in df["committees"]]
print(df)
df.to_stata("result.dta")

Results in

                         id                                         committees
0    D8BxG5664FFb8AVc6KTphJ                                    House Judiciary
1    D8BxG5664FFb8AVc6KTphJ                                   Senate Judiciary
2    8XQE5wu3mU7qvVPDpUWaGP                                  House Agriculture
3    8XQE5wu3mU7qvVPDpUWaGP        Senate Agriculture, Nutrition, and Forestry
4    kzZRLAHdMK4YCUQtQAdCPY                                  House Agriculture
..                      ...                                                ...
406  ZxXooeLGVAKec9W2i32hL5                                  House Agriculture
407  ZxXooeLGVAKec9W2i32hL5  Senate Agriculture, Nutrition, and Forestry; H...
408  ZxXooeLGVAKec9W2i32hL5        House Appropriations; Senate Appropriations
409  ahmmafKLfRP8wZay9o8GRf                                  House Agriculture
410  ahmmafKLfRP8wZay9o8GRf        Senate Agriculture, Nutrition, and Forestry

[411 rows x 2 columns]

Thank you very much Markus for your very precise answer. Actually I was looking at the data and I noticed this way the "id" that is selected is inside the path special_issues>bills_by_algo. I was actually trying to extract one that is outside the path and defines the observation ("_id" not "id"). Is it possible to extract that value instead? — ocut, Jan 31 '20 at 23:13
I have not found any `_id` in the JSON file, I have changed to code now to use the `id` from the special issue. For this, you have to save it in the for loop before the `for x in special_issue["bills_by_algo"]:`. — Markus Rosenfelder, Feb 01 '20 at 00:38
Thank you Markus. When we turn the file into stata, is there a way to put each committes value in a different variable ? With other requests some of the committees values exceed 244 characters and are hence not supported by stata. — ocut, Feb 01 '20 at 22:07
Yes, you can convert your lists of strings to dummy columns, as you can read up in this question: https://stackoverflow.com/questions/29034928/pandas-convert-a-column-of-list-to-dummies — Markus Rosenfelder, Feb 02 '20 at 01:10

How to extract objects from nested lists from a Json file with Python?

1 Answers1