0

I have a large nested json object that I'd like to read into a pandas dataframe. The json object has many nested json pairs inside of it. The basic layering is:

  • API Call metadata (I don't really care about this)
  • Survey response metadata (I'd like this information to be included in the final output)
  • Pages (survey pages with an id number, and what questions are on that page)
  • Questions (the question_ids, and related response_ids from the questions)

A sample looks like this:

{
    "per_page": 50,
    "total": 4,
    "data": [
        {
            "total_time": 5276,
            "href": "https://somelink.com",
            "custom_variables": {},
            "ip_address": "XXX.XXX.XX.XX",
            "id": "1111111",
            "logic_path": {},
            "date_modified": "2018-08-17T19:57:43+00:00",
            "response_status": "completed",
            "custom_value": "",
            "analyze_url": "https://somelink.com/respondent_id=1111111",
            "pages": [
                {
                    "id": "38638937",
                    "questions": []
                },
                {
                    "id": "38638938",
                    "questions": [
                        {
                            "id": "124810659",
                            "answers": [
                                {
                                    "text": "some_answer_text"
                                }
                            ]
                        },
                        {
                            "id": "124810660",
                            "answers": [
                                {
                                    "text": "some_other_answer_text"
                                }
                            ]
                        }
                    ]
                },
                {
                    "id": "38638944",
                    "questions": [
                        {
                            "id": "124810656",
                            "answers": [
                                {
                                    "col_id": "905794209",
                                    "choice_id": "905794459",
                                    "row_id": "905794204"
                                },
                                {
                                    "col_id": "905794210",
                                    "choice_id": "905794463",
                                    "row_id": "905794204"
                                },
                                {
                                    "col_id": "905794209",
                                    "choice_id": "905794459",
                                    "row_id": "905794205"
                                },
                                {
                                    "col_id": "905794210",
                                    "choice_id": "905794464",
                                    "row_id": "905794205"
                                }
                            ]
                        }
                    ]   
                }
                .
                .
                .
                .
                .
                .
                .
                .
            ],
            "page_path": [],
            "recipient_id": "4107168056",
            "collector_id": "216279750",
            "date_created": "2018-09-05T15:28:38+00:00",
            "survey_id": "222222222",
            "collection_mode": "default",
            "edit_url": "https://www.somelink.com/somerefnumber",
            "metadata": {
                "contact": {
                    "email": {
                        "type": "string",
                        "value": "name@somememail.com"
                    }
                }
            }
        },
        {
            "total_time": 6978,
            "href": "https://somelink.com",
            "custom_variables": {},
            "ip_address": "XXX.XXX.XX.XX",
            "id": "4444444",
            "logic_path": {},
            "date_modified": "2018-08-15T19:16:43+00:00",
            "response_status": "completed",
            "custom_value": "",
            "analyze_url": "https://somelink.com/respondent_id=4444444",
            "pages": [
                    .
                    .
                    .

            ]
        }

    ],
    "page": 1,
    "links": {
        "self": "https://api.somelink.com/22222222/responses/bulk?page=1&per_page=50"
    }
}

There can be any number of responses, pages, and questions within a page.

My question is: How can I get the above json into a pandas datafame that looks like this: Formatted Pandas dataframe (as excel for reference)

I've tried using json_normalize, but I believe I am making some mistakes.

import pandas as pd
import requests
from pandas.io.json import json_normalize

headers={"Authorization": "Bearer %s" % MY_ACCESS_TOKEN,
                           "Content-Type": "application/json"}

url = "https://api.surveymonkey.com/v3/surveys/%s/responses/bulk" % (my_survey_id)
response = requests.get(url, headers=headers)
responses = response.json()

pages_data = json_normalize(data=responses['data'], record_path='pages', meta=['response_status', 'recipient_id', 'collector_id', 'survey_id', 'date_created', 'date_modified', 'ip_address', 'href', 'total_time'])
print(pages_data.head(10))
pages_data.to_csv("output.csv")

As addintional info, this is from the SurveyMonkey API. SurveyMonkey lets you export survey results to csv via the web interface, but I'd like to use the API to re-create the standard response report, and eventually make custom reports/do other things.

I'm using python3.6. I'm assuming I need to do some pre-processing to flatten out the data, but I'm not sure how to do so. Any help appreciated!

user2752159
  • 1,182
  • 3
  • 13
  • 29
  • if the json is always in the same format, could you take the values you need, put them in a list (or other format), then build a dataframe from there? – MattR Oct 29 '18 at 20:23
  • @MattR, the JSON will be different depending on the survey. I'd like to programmatically flatten the JSON regardless of survey structure. If I did know the survey structure, I could try to extract the values manually into a list. Do you have code snippets that could help with that? – user2752159 Oct 30 '18 at 16:17
  • you can think of JSON as dictionaries, so anything that will flatten dictionaries (like maybe [this post](https://stackoverflow.com/questions/33440624/pythonic-way-to-flatten-a-dictionary-into-a-list-using-list-comprehension) will help you. – MattR Oct 30 '18 at 16:31

2 Answers2

0

Here is a way to flatten the nested dictionary in pandas using glom. The aim is to extract selected keys and value from the nested dictionary and save them in a separate column of the pandas dataframe (:

Here is a step by step guide: https://medium.com/@enrico.alemani/flatten-nested-dictionaries-in-pandas-using-glom-7948345c88f5

import pandas as pd
from glom import glom
from ast import literal_eval


target = {
    "page": 10,
    "total": 100,
    "data":
        {
            "total_time": 100,
            "href": "https://mylink.com",
            "ip_address": "000.00.00.00",
            "id": "012345",
            "response_status": "completed",
        }
}

# Import data
df = pd.DataFrame([str(target)], columns=['target'])

# Extract id keys and save value into a separate pandas column
df['id'] = df['target'].apply(lambda row: glom(literal_eval(row), 'data.id'))
iEriii
  • 403
  • 2
  • 7
-2

Convert your json into dataframe:

df = pd.DataFrame(json)

after:

df = df.to_csv(file, sep='\t')