Large nested json to pandas dataframe python

Question

I have a large nested json object that I'd like to read into a pandas dataframe. The json object has many nested json pairs inside of it. The basic layering is:

API Call metadata (I don't really care about this)
Survey response metadata (I'd like this information to be included in the final output)
Pages (survey pages with an id number, and what questions are on that page)
Questions (the question_ids, and related response_ids from the questions)

A sample looks like this:

{
    "per_page": 50,
    "total": 4,
    "data": [
        {
            "total_time": 5276,
            "href": "https://somelink.com",
            "custom_variables": {},
            "ip_address": "XXX.XXX.XX.XX",
            "id": "1111111",
            "logic_path": {},
            "date_modified": "2018-08-17T19:57:43+00:00",
            "response_status": "completed",
            "custom_value": "",
            "analyze_url": "https://somelink.com/respondent_id=1111111",
            "pages": [
                {
                    "id": "38638937",
                    "questions": []
                },
                {
                    "id": "38638938",
                    "questions": [
                        {
                            "id": "124810659",
                            "answers": [
                                {
                                    "text": "some_answer_text"
                                }
                            ]
                        },
                        {
                            "id": "124810660",
                            "answers": [
                                {
                                    "text": "some_other_answer_text"
                                }
                            ]
                        }
                    ]
                },
                {
                    "id": "38638944",
                    "questions": [
                        {
                            "id": "124810656",
                            "answers": [
                                {
                                    "col_id": "905794209",
                                    "choice_id": "905794459",
                                    "row_id": "905794204"
                                },
                                {
                                    "col_id": "905794210",
                                    "choice_id": "905794463",
                                    "row_id": "905794204"
                                },
                                {
                                    "col_id": "905794209",
                                    "choice_id": "905794459",
                                    "row_id": "905794205"
                                },
                                {
                                    "col_id": "905794210",
                                    "choice_id": "905794464",
                                    "row_id": "905794205"
                                }
                            ]
                        }
                    ]   
                }
                .
                .
                .
                .
                .
                .
                .
                .
            ],
            "page_path": [],
            "recipient_id": "4107168056",
            "collector_id": "216279750",
            "date_created": "2018-09-05T15:28:38+00:00",
            "survey_id": "222222222",
            "collection_mode": "default",
            "edit_url": "https://www.somelink.com/somerefnumber",
            "metadata": {
                "contact": {
                    "email": {
                        "type": "string",
                        "value": "name@somememail.com"
                    }
                }
            }
        },
        {
            "total_time": 6978,
            "href": "https://somelink.com",
            "custom_variables": {},
            "ip_address": "XXX.XXX.XX.XX",
            "id": "4444444",
            "logic_path": {},
            "date_modified": "2018-08-15T19:16:43+00:00",
            "response_status": "completed",
            "custom_value": "",
            "analyze_url": "https://somelink.com/respondent_id=4444444",
            "pages": [
                    .
                    .
                    .

            ]
        }

    ],
    "page": 1,
    "links": {
        "self": "https://api.somelink.com/22222222/responses/bulk?page=1&per_page=50"
    }
}

There can be any number of responses, pages, and questions within a page.

My question is: How can I get the above json into a pandas datafame that looks like this:

I've tried using json_normalize, but I believe I am making some mistakes.

import pandas as pd
import requests
from pandas.io.json import json_normalize

headers={"Authorization": "Bearer %s" % MY_ACCESS_TOKEN,
                           "Content-Type": "application/json"}

url = "https://api.surveymonkey.com/v3/surveys/%s/responses/bulk" % (my_survey_id)
response = requests.get(url, headers=headers)
responses = response.json()

pages_data = json_normalize(data=responses['data'], record_path='pages', meta=['response_status', 'recipient_id', 'collector_id', 'survey_id', 'date_created', 'date_modified', 'ip_address', 'href', 'total_time'])
print(pages_data.head(10))
pages_data.to_csv("output.csv")

As addintional info, this is from the SurveyMonkey API. SurveyMonkey lets you export survey results to csv via the web interface, but I'd like to use the API to re-create the standard response report, and eventually make custom reports/do other things.

I'm using python3.6. I'm assuming I need to do some pre-processing to flatten out the data, but I'm not sure how to do so. Any help appreciated!

if the json is always in the same format, could you take the values you need, put them in a list (or other format), then build a dataframe from there? — MattR, Oct 29 '18 at 20:23
@MattR, the JSON will be different depending on the survey. I'd like to programmatically flatten the JSON regardless of survey structure. If I did know the survey structure, I could try to extract the values manually into a list. Do you have code snippets that could help with that? — user2752159, Oct 30 '18 at 16:17
you can think of JSON as dictionaries, so anything that will flatten dictionaries (like maybe [this post](https://stackoverflow.com/questions/33440624/pythonic-way-to-flatten-a-dictionary-into-a-list-using-list-comprehension) will help you. — MattR, Oct 30 '18 at 16:31

score 0 · Answer 1 · answered Jun 23 '20 at 11:08

Here is a way to flatten the nested dictionary in pandas using glom. The aim is to extract selected keys and value from the nested dictionary and save them in a separate column of the pandas dataframe (:

Here is a step by step guide: https://medium.com/@enrico.alemani/flatten-nested-dictionaries-in-pandas-using-glom-7948345c88f5

import pandas as pd
from glom import glom
from ast import literal_eval


target = {
    "page": 10,
    "total": 100,
    "data":
        {
            "total_time": 100,
            "href": "https://mylink.com",
            "ip_address": "000.00.00.00",
            "id": "012345",
            "response_status": "completed",
        }
}

# Import data
df = pd.DataFrame([str(target)], columns=['target'])

# Extract id keys and save value into a separate pandas column
df['id'] = df['target'].apply(lambda row: glom(literal_eval(row), 'data.id'))

score -2 · Answer 2 · answered Oct 29 '18 at 21:42

-2

Convert your json into dataframe:

df = pd.DataFrame(json)

after:

df = df.to_csv(file, sep='\t')

answered Oct 29 '18 at 21:42

Jose Salas

7
2

1

This doesn't seem to quite hit what OP is looking for. – d_kennetz Oct 29 '18 at 21:45

Large nested json to pandas dataframe python

2 Answers2