I have a large nested json object that I'd like to read into a pandas dataframe. The json object has many nested json pairs inside of it. The basic layering is:
- API Call metadata (I don't really care about this)
- Survey response metadata (I'd like this information to be included in the final output)
- Pages (survey pages with an id number, and what questions are on that page)
- Questions (the question_ids, and related response_ids from the questions)
A sample looks like this:
{
"per_page": 50,
"total": 4,
"data": [
{
"total_time": 5276,
"href": "https://somelink.com",
"custom_variables": {},
"ip_address": "XXX.XXX.XX.XX",
"id": "1111111",
"logic_path": {},
"date_modified": "2018-08-17T19:57:43+00:00",
"response_status": "completed",
"custom_value": "",
"analyze_url": "https://somelink.com/respondent_id=1111111",
"pages": [
{
"id": "38638937",
"questions": []
},
{
"id": "38638938",
"questions": [
{
"id": "124810659",
"answers": [
{
"text": "some_answer_text"
}
]
},
{
"id": "124810660",
"answers": [
{
"text": "some_other_answer_text"
}
]
}
]
},
{
"id": "38638944",
"questions": [
{
"id": "124810656",
"answers": [
{
"col_id": "905794209",
"choice_id": "905794459",
"row_id": "905794204"
},
{
"col_id": "905794210",
"choice_id": "905794463",
"row_id": "905794204"
},
{
"col_id": "905794209",
"choice_id": "905794459",
"row_id": "905794205"
},
{
"col_id": "905794210",
"choice_id": "905794464",
"row_id": "905794205"
}
]
}
]
}
.
.
.
.
.
.
.
.
],
"page_path": [],
"recipient_id": "4107168056",
"collector_id": "216279750",
"date_created": "2018-09-05T15:28:38+00:00",
"survey_id": "222222222",
"collection_mode": "default",
"edit_url": "https://www.somelink.com/somerefnumber",
"metadata": {
"contact": {
"email": {
"type": "string",
"value": "name@somememail.com"
}
}
}
},
{
"total_time": 6978,
"href": "https://somelink.com",
"custom_variables": {},
"ip_address": "XXX.XXX.XX.XX",
"id": "4444444",
"logic_path": {},
"date_modified": "2018-08-15T19:16:43+00:00",
"response_status": "completed",
"custom_value": "",
"analyze_url": "https://somelink.com/respondent_id=4444444",
"pages": [
.
.
.
]
}
],
"page": 1,
"links": {
"self": "https://api.somelink.com/22222222/responses/bulk?page=1&per_page=50"
}
}
There can be any number of responses, pages, and questions within a page.
My question is:
How can I get the above json into a pandas datafame that looks like this:
I've tried using json_normalize, but I believe I am making some mistakes.
import pandas as pd
import requests
from pandas.io.json import json_normalize
headers={"Authorization": "Bearer %s" % MY_ACCESS_TOKEN,
"Content-Type": "application/json"}
url = "https://api.surveymonkey.com/v3/surveys/%s/responses/bulk" % (my_survey_id)
response = requests.get(url, headers=headers)
responses = response.json()
pages_data = json_normalize(data=responses['data'], record_path='pages', meta=['response_status', 'recipient_id', 'collector_id', 'survey_id', 'date_created', 'date_modified', 'ip_address', 'href', 'total_time'])
print(pages_data.head(10))
pages_data.to_csv("output.csv")
As addintional info, this is from the SurveyMonkey API. SurveyMonkey lets you export survey results to csv via the web interface, but I'd like to use the API to re-create the standard response report, and eventually make custom reports/do other things.
I'm using python3.6. I'm assuming I need to do some pre-processing to flatten out the data, but I'm not sure how to do so. Any help appreciated!