0

I am trying to convert a heavily nested JSON file into a data format that can easily be loaded into a database table. I've been using json_normalize from the pandas library to flatten various parts of the file into a tabular format. This has mostly worked, however i've come across a section with the following structure:

{"data": {
    "Revenue": {
        "2019-05-31": 50,
        "2018-05-31": 60,
        "2017-05-31": 70,
        "2016-05-31": 75,
        "2015-05-31": 90,
        "2014-05-31": 100
    },
    "Cost of Revenue": {
        "2019-05-31": 45,
        "2018-05-31": 40,
        "2017-05-31": 57,
        "2016-05-31": 58,
        "2015-05-31": 68,
        "2014-05-31": 70
    }
    }}

I would like to convert this data into the following format (assuming this is the best way to structure this type of data):

measure                 period end     value
Revenue                 2019-05-31     50
Revenue                 2018-05-31     60
Revenue                 2017-05-31     70
Revenue                 2016-05-31     75
Revenue                 2015-05-31     90
Revenue                 2014-05-31     100
Cost of Revenue         2019-05-31     45
Cost of Revenue         2018-05-31     40
Cost of Revenue         2017-05-31     57
Cost of Revenue         2016-05-31     58
Cost of Revenue         2015-05-31     68
Cost of Revenue         2014-05-31     70

I won't know what the 'measure' and 'period end' key names are ahead of time, but it will always have the above structure. I haven't been able to find a generic solution, but i'm fairly new to Python.

Any help or guidance would be much appreciated.

LostAtSea
  • 3
  • 1
  • might help https://stackoverflow.com/questions/21104592/json-to-pandas-dataframe – Yeganeh Salami Dec 06 '19 at 12:06
  • many thanks, i have tried 'json_normalize', which will give me the combined measure and period end as column headers (i.e. Cost of Revenue_2014-05-31). I was thinking of transposing this and then splitting on the '_' to get it back into columns, however it seems a bit of a messy way of doing it. – LostAtSea Dec 06 '19 at 12:23

1 Answers1

2

One approch is load the 'data' part into DataFrame and stack it.

df = pd.DataFrame.from_dict(data['data'], orient='index').stack().reset_index()
df.columns = ['measure', 'period end', 'value']
df

It will give you the output as follows

    measure     period end  value
0   Revenue     2019-05-31  50
1   Revenue     2018-05-31  60
2   Revenue     2017-05-31  70
3   Revenue     2016-05-31  75
4   Revenue     2015-05-31  90
5   Revenue     2014-05-31  100
6   Cost of Revenue     2019-05-31  45
7   Cost of Revenue     2018-05-31  40
8   Cost of Revenue     2017-05-31  57
9   Cost of Revenue     2016-05-31  58
10  Cost of Revenue     2015-05-31  68
11  Cost of Revenue     2014-05-31  70
Prince Francis
  • 2,995
  • 1
  • 14
  • 22
  • Excellent, many thanks, after changing `pd.DataFrame.from_dict(data['data'], orient='index').stack().reset_index()` to `df = pd.DataFrame.from_dict(data['data'], orient='index').stack().reset_index()` it worked perfectly, this is exactly what i was looking for. I'm not sure how its working, but at least i know what i need to learn. Thanks again. – LostAtSea Dec 06 '19 at 12:42