Partially flatten nested JSON and pivot longer

Question

I have many JSON files with the following structure:

{
  "requestId": "test",
  "executionDate": "2023-05-10",
  "executionTime": "12:02:22",
  "request": {
    "fields": [{
      "geometry": {
        "type": "Point",
        "coordinates": [-90, 41]
      },
      "colour": "blue",
      "bean": "blaCk",
      "birthday": "2021-01-01",
      "arst": "111",
      "arstg": "rst",
      "fct": {
        "start": "2011-01-10",
        "end": "2012-01-10"
      }
    }]
  },
  "response": {
    "results": [{
        "geom": {
          "type": "geo",
          "coord": [-90, 41]
        },
        "md": {
          "type": "arstat",
          "mdl": "trstr",
          "vs": "v0",
          "cal": {
            "num": 4,
            "comment": "message"
          },
          "bean": ["blue", "green"],
          "result_time": 12342
        },
        "predictions": [{
            "date": "2004-05-19",
            "day": 0,
            "count": 0,
            "eating_stage": "trt"
          }, {
            "date": "2002-01-20",
            "day": 1,
            "count": 0,
            "eating_stage": "arstg"
          }, {
            "date": "2004-05-21",
            "day": 2,
            "count": 0,
            "eating_stage": "strg"
          }, {
            "date": "2004-05-22",
            "day": 3,
            "count": 0,
            "eating_stage": "rst"
          }
        }
      }
    }

The predictions part can be very deep. I want to convert this JSON to a CSV with the following structure:

requestId	executionDate	executionTime	colour	predictions_date	predictions_day	predictions_eating_stage
test	2023-05-10	12:02:22	blue	2004-05-19	0	trt
test	2023-05-10	12:02:22	blue	2002-01-20	1	astrg
test	2023-05-10	12:02:22	blue	2004-05-21	2	strg
test	2023-05-10	12:02:22	blue	2004-05-22	3	rst

I tried the following code:

flat_json = pd.DataFrame(
    flatten(json_data), index=[0]
)

The code results in every data point becoming a column, and I am not sure how to pivot longer where at the 'predictions' key using JSON functions in Python. I recognise that at this stage I could pivot longer using column names, but I feel like there is a cleaner way to achieve this.

score 1 · Accepted Answer · answered Jun 01 '23 at 14:35

1

I would suggest simply extracting what you need. It seems very specific for it to be solved using specific parsing. Therefore I would start by creating two dataframes:

df_prediction = pd.DataFrame(example['response']['results'][0]['predictions'])
df_data = pd.DataFrame({x:y for x,y in example.items() if type(y)==str},index=[0])

Renaming columns in predictions:

df_prediction.columns = ['prediction_'+x for x in df_prediction]

Joining and adding the last piece of data (colour):

output = df_data.assign(colour = example['request']['fields'][0]['colour']).join(df_prediction,how='right').ffill()

Outputting:

  requestId executionDate  ... prediction_count prediction_eating_stage
0      test    2023-05-10  ...                0                     trt
1      test    2023-05-10  ...                0                   arstg
2      test    2023-05-10  ...                0                    strg
3      test    2023-05-10  ...                0                     rst

answered Jun 01 '23 at 14:35

Celius Stingher

17,835
6
23
53

1

Perfect, thank you for the structured answer and clear explanation. Worked as intended. I like that you created two dataframes, as that is what I was thinking long term to make use of the relational component. – prayner Jun 01 '23 at 15:29
Happy to help. Yeah, I couldn't find a way to "smartly" capture that data you want, so I just thought it would be best too just extract it from the json/file. – Celius Stingher Jun 01 '23 at 15:31
I just didn't really understand how do go about it as I am inexperienced with restructuring JSONs and using python. This approach makes sense to me. – prayner Jun 01 '23 at 15:40
You can check this other answer of mine with a little bit more explanation around JSONs in python. – Celius Stingher Jun 01 '23 at 18:03
1

Sorry, here's the link: https://stackoverflow.com/questions/57875259/how-to-extract-nested-json-data/57875343#57875343 – Celius Stingher Jun 01 '23 at 18:12

score 1 · Answer 2 · answered Jun 01 '23 at 19:49

You can also use json_normalize to extract the array of records that you want to normalize into a csv.

>>> df_predictions = pd.json_normalize(json_data,record_path=['response', 'results','predictions'], record_prefix='predictions.', meta=['requestId', 'executionDate', 'executionTime']).assign(colour = json_data['request']['fields'][0]['colour'])
>>> df_predictions
  predictions.date  predictions.day  predictions.count  ... executionDate executionTime colour
0       2004-05-19                0                  0  ...    2023-05-10      12:02:22   blue
1       2002-01-20                1                  0  ...    2023-05-10      12:02:22   blue
2       2004-05-21                2                  0  ...    2023-05-10      12:02:22   blue
3       2004-05-22                3                  0  ...    2023-05-10      12:02:22   blue

[4 rows x 8 columns]

It is just unfortunate that there is a limitation on the meta fields as it is throwing an exception for a path that includes an array / list so the "colour" column was added separately. If the order is important, then you can rearrange the columns as needed.

Partially flatten nested JSON and pivot longer

2 Answers2