How to generate n-level hierarchical JSON from pandas DataFrame?

Question

Is there an efficient way to create hierarchical JSON (n-levels deep) where the parent values are the keys and not the variable label? i.e:

{"2017-12-31":
    {"Junior":
        {"Electronics":
            {"A":
                {"sales": 0.440755
                }
            },
            {"B":
                {"sales": -3.230951
                }
            }
        }, ...etc...
    }, ...etc...
}, ...etc...

1. My testing DataFrame:

colIndex=pd.MultiIndex.from_product([['New York','Paris'],
                                     ['Electronics','Household'],
                                     ['A','B','C'],
                                     ['Junior','Senior']],
                               names=['City','Department','Team','Job Role'])

rowIndex=pd.date_range('25-12-2017',periods=12,freq='D')

df1=pd.DataFrame(np.random.randn(12, 24), index=rowIndex, columns=colIndex)
df1.index.name='Date'
df2=df1.resample('M').sum()
df3=df2.stack(level=0).groupby('Date').sum()

2. Transformation I'm making as it seems to be the most logical structure to build the JSON from:

df4=df3.stack(level=[0,1,2]).reset_index() \
    .set_index(['Date','Job Role','Department','Team']) \
    .sort_index()

3. My attempts-so-far

I came across this very helpful SO question which solves the problem for one level of nesting using code along the lines of:

j =(df.groupby(['ID','Location','Country','Latitude','Longitude'],as_index=False) \
    .apply(lambda x: x[['timestamp','tide']].to_dict('r'))\
    .reset_index()\
    .rename(columns={0:'Tide-Data'})\
    .to_json(orient='records'))

...but I can't find a way to get nested .groupby()s working:

j=(df.groupby('date', as_index=True).apply(
    lambda x: x.groupby('Job Role', as_index=True).apply(
        lambda x: x.groupby('Department', as_index=True).apply(
            lambda x: x.groupby('Team', as_index=True).to_dict())))  \
                .reset_index().rename(columns={0:'sales'}).to_json(orient='records'))

can you post a sample data set in text/CSV form, so we could copy and paste it? — MaxU - stand with Ukraine, Sep 13 '17 at 19:52
@MaxU - I've updated the start of the question with my input dummy DataFrame - thanks! — Bendy, Sep 13 '17 at 20:17

score 9 · Accepted Answer · answered Sep 18 '17 at 07:56

You can use itertuples to generate a nested dict, and then dump to json. To do this, you need to change the date timestamp to string

df4=df3.stack(level=[0,1,2]).reset_index() 
df4['Date'] = df4['Date'].dt.strftime('%Y-%m-%d')
df4 = df4.set_index(['Date','Job Role','Department','Team']) \
    .sort_index()

create the nested dict

def nested_dict():
    return collections.defaultdict(nested_dict)
result = nested_dict()

Use itertuples to populate it

for row in df4.itertuples():
    result[row.Index[0]][row.Index[1]][row.Index[2]][row.Index[3]]['sales'] = row._1
    # print(row)

and then use the json module to dump it.

import json
json.dumps(result)

'{"2017-12-31": {"Junior": {"Electronics": {"A": {"sales": -0.3947134370101142}, "B": {"sales": -0.9873530754403204}, "C": {"sales": -1.1182598058984508}}, "Household": {"A": {"sales": -1.1211850078098677}, "B": {"sales": 2.0330914483907847}, "C": {"sales": 3.94762379718749}}}, "Senior": {"Electronics": {"A": {"sales": 1.4528493451404196}, "B": {"sales": -2.3277322345261005}, "C": {"sales": -2.8040263791743922}}, "Household": {"A": {"sales": 3.0972591929279663}, "B": {"sales": 9.884565742502392}, "C": {"sales": 2.9359830722457576}}}}, "2018-01-31": {"Junior": {"Electronics": {"A": {"sales": -1.3580300149125217}, "B": {"sales": 1.414665000013205}, "C": {"sales": -1.432795129108244}}, "Household": {"A": {"sales": 2.7783259569115346}, "B": {"sales": 2.717700275321333}, "C": {"sales": 1.4358377416259644}}}, "Senior": {"Electronics": {"A": {"sales": 2.8981726774941485}, "B": {"sales": 12.022897003654117}, "C": {"sales": 0.01776855733076088}}, "Household": {"A": {"sales": -3.342163776613092}, "B": {"sales": -5.283208386572307}, "C": {"sales": 2.942580121975619}}}}}'

I got the nested_dict method from [this SO-post](https://stackoverflow.com/a/36299615/1562285). Apparently it can be even [shorter](https://stackoverflow.com/a/8702435/1562285) `nested_dict = lambda: defaultdict(nested_dict)` — Maarten Fabré, Sep 18 '17 at 09:03
Could you please help me on this question. I am consfused as how to achieve that. https://stackoverflow.com/questions/53477724/pyspark-how-to-create-a-nested-json-from-spark-data-frame — Shankar Panda, Nov 27 '18 at 05:16

score 4 · Answer 2 · answered Sep 05 '18 at 15:44

I ran into this and was confused by the complexity of the OP's setup. Here is a minimal example and solution (based on the answer provided by @Maarten Fabré).

import collections
import pandas as pd

# build init DF
x = ['a', 'a']
y = ['b', 'c']
z = [['d'], ['e', 'f']]
df = pd.DataFrame(list(zip(x, y, z)), columns=['x', 'y', 'z'])

#    x  y       z
# 0  a  b     [d]
# 1  a  c  [e, f]

Set up the the regular, flat, index, and then make that a multi index

# set flat index
df = df.set_index(['x', 'y'])

# set up multi index
df = df.reindex(pd.MultiIndex.from_tuples(zip(x, y)))      

#           z
# a b     [d]
#   c  [e, f]

Then init a nested dictionary, and fill it out item-by-item

nested_dict = collections.defaultdict(dict)

for keys, value in df.z.iteritems():
    nested_dict[keys[0]][keys[1]] = value

# defaultdict(dict, {'a': {'b': ['d'], 'c': ['e', 'f']}})

At this point you can JSON dump it, etc.

How to generate n-level hierarchical JSON from pandas DataFrame?

2 Answers2

Linked