154

Suppose I have a nested dictionary 'user_dict' with structure:

  • Level 1: UserId (Long Integer)
  • Level 2: Category (String)
  • Level 3: Assorted Attributes (floats, ints, etc..)

For example, an entry of this dictionary would be:

user_dict[12] = {
    "Category 1": {"att_1": 1, 
                   "att_2": "whatever"},
    "Category 2": {"att_1": 23, 
                   "att_2": "another"}}

each item in user_dict has the same structure and user_dict contains a large number of items which I want to feed to a pandas DataFrame, constructing the series from the attributes. In this case a hierarchical index would be useful for the purpose.

Specifically, my question is whether there exists a way to to help the DataFrame constructor understand that the series should be built from the values of the "level 3" in the dictionary?

If I try something like:

df = pandas.DataFrame(users_summary)

The items in "level 1" (the UserId's) are taken as columns, which is the opposite of what I want to achieve (have UserId's as index).

I know I could construct the series after iterating over the dictionary entries, but if there is a more direct way this would be very useful. A similar question would be asking whether it is possible to construct a pandas DataFrame from json objects listed in a file.

smci
  • 32,567
  • 20
  • 113
  • 146
vladimir montealegre
  • 2,010
  • 2
  • 15
  • 17
  • See [this answer](https://stackoverflow.com/a/54300940/4909087) for simpler alternatives. – cs95 Jan 22 '19 at 04:34

7 Answers7

208

A pandas MultiIndex consists of a list of tuples. So the most natural approach would be to reshape your input dict so that its keys are tuples corresponding to the multi-index values you require. Then you can just construct your dataframe using pd.DataFrame.from_dict, using the option orient='index':

user_dict = {12: {'Category 1': {'att_1': 1, 'att_2': 'whatever'},
                  'Category 2': {'att_1': 23, 'att_2': 'another'}},
             15: {'Category 1': {'att_1': 10, 'att_2': 'foo'},
                  'Category 2': {'att_1': 30, 'att_2': 'bar'}}}

pd.DataFrame.from_dict({(i,j): user_dict[i][j] 
                           for i in user_dict.keys() 
                           for j in user_dict[i].keys()},
                       orient='index')


               att_1     att_2
12 Category 1      1  whatever
   Category 2     23   another
15 Category 1     10       foo
   Category 2     30       bar

An alternative approach would be to build your dataframe up by concatenating the component dataframes:

user_ids = []
frames = []

for user_id, d in user_dict.iteritems():
    user_ids.append(user_id)
    frames.append(pd.DataFrame.from_dict(d, orient='index'))

pd.concat(frames, keys=user_ids)

               att_1     att_2
12 Category 1      1  whatever
   Category 2     23   another
15 Category 1     10       foo
   Category 2     30       bar
maxymoo
  • 35,286
  • 11
  • 92
  • 119
Wouter Overmeire
  • 65,766
  • 10
  • 63
  • 43
  • 18
    Is there a reasonable way to generalise this to work with arbitrary depth ragged lists? e.g. lists to an arbitrary depth, where some branches may be shorter than others, and a None or nan is used when shorter branches don't reach the end? – naught101 Nov 11 '13 at 05:59
  • 7
    Have you looked at pandas json support (io tools) and normalization? http://pandas.pydata.org/pandas-docs/dev/io.html#normalization – Wouter Overmeire Nov 12 '13 at 10:40
  • 1
    for me, the first method created a dataframe with a single index with tuples. the second method worked as desired/expected! – arturomp Apr 30 '18 at 20:57
  • Any tips on how to name these new columns? For example, if I want these numbers 12 and 15 to be in the column 'id'. – cheremushkin Mar 02 '19 at 16:41
  • 1
    @cheremushkin 12 and 15 are now in the row 'id', if you tranpose (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transpose.html) they are in the column 'id'. You can also unstack (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.unstack.html) It all depends on what you really need. – Wouter Overmeire Mar 03 '19 at 11:07
  • @WouterOvermeire but if I want to sort by this column name I get "['id'] not in index" – cheremushkin Mar 05 '19 at 21:04
  • @cheremushkin Could you open a new question on SO? – Wouter Overmeire Mar 06 '19 at 18:47
  • 4
    in python 3 a dict no longer has iteritems method, in the 2nd approach this line `for user_id, d in user_dict.iteritems():` should be changed `for user_id, d in user_dict.items():` – Madcat Aug 29 '20 at 13:58
  • Wow... The second method is so intuitive and easy that i managed to do EXACTLY what i wanted even if i didn't khow how to use pandas since it's my first time with it. Awesome explanation @WouterOvermeire... Thanks for taking the time – LukeSavefrogs Feb 03 '21 at 03:42
63

pd.concat accepts a dictionary. With this in mind, it is possible to improve upon the currently accepted answer in terms of simplicity and performance by use a dictionary comprehension to build a dictionary mapping keys to sub-frames.

pd.concat({k: pd.DataFrame(v).T for k, v in user_dict.items()}, axis=0)

Or,

pd.concat({
        k: pd.DataFrame.from_dict(v, 'index') for k, v in user_dict.items()
    }, 
    axis=0)

              att_1     att_2
12 Category 1     1  whatever
   Category 2    23   another
15 Category 1    10       foo
   Category 2    30       bar
cs95
  • 379,657
  • 97
  • 704
  • 746
  • 7
    Brilliant! Much better :) – pg2455 Mar 28 '19 at 16:18
  • 5
    How would you do it if you still had an further inner category? Such as `12:{cat1:{cat11:{att1:val1,att2:val2}}}`. In other words: how would some one generalize the solution to an irrelevant number of categories ? – Lucas Aimaretto Sep 17 '19 at 20:40
  • 3
    @LucasAimaretto Usually arbitrarily nested structures can be flattened with `json_normalize`. I have [another answer](https://stackoverflow.com/a/53831756/4909087) which shows how it works. – cs95 Sep 17 '19 at 21:10
  • 1
    Doesn't work if `v` is an single integer for example. Do you know an alternative in such case? – swiss_knight Apr 09 '20 at 20:38
  • Why not to treat as a nested json? `pd.json_normalize` should do the work for you – Tito Sanz Jan 19 '21 at 10:20
  • @TitoSanz Well, did you try it? How did that work out? – cs95 Jan 30 '21 at 05:16
  • I posted an answer that should work for arbitrary depth – tRosenflanz Mar 24 '21 at 21:25
  • @cs95 If I have 3 nested dictionary like this `dict = {L1: {L2: {L3: {L4: [array([` then how I have to write? I am writting `pd.json_normalize(dict)` but getting error `KeyError: '0.1'` (as my L1 is 0.1) – 0Knowledge May 25 '21 at 01:04
27

This solution should work for arbitrary depth by flattening dictionary keys to a tuple chain

def flatten_dict(nested_dict):
    res = {}
    if isinstance(nested_dict, dict):
        for k in nested_dict:
            flattened_dict = flatten_dict(nested_dict[k])
            for key, val in flattened_dict.items():
                key = list(key)
                key.insert(0, k)
                res[tuple(key)] = val
    else:
        res[()] = nested_dict
    return res


def nested_dict_to_df(values_dict):
    flat_dict = flatten_dict(values_dict)
    df = pd.DataFrame.from_dict(flat_dict, orient="index")
    df.index = pd.MultiIndex.from_tuples(df.index)
    df = df.unstack(level=-1)
    df.columns = df.columns.map("{0[1]}".format)
    return df
tRosenflanz
  • 724
  • 5
  • 8
16

In case someone wants to get the data frame in a "long format" (leaf values have the same type) without multiindex, you can do this:

pd.DataFrame.from_records(
    [
        (level1, level2, level3, leaf)
        for level1, level2_dict in user_dict.items()
        for level2, level3_dict in level2_dict.items()
        for level3, leaf in level3_dict.items()
    ],
    columns=['UserId', 'Category', 'Attribute', 'value']
)

    UserId    Category Attribute     value
0       12  Category 1     att_1         1
1       12  Category 1     att_2  whatever
2       12  Category 2     att_1        23
3       12  Category 2     att_2   another
4       15  Category 1     att_1        10
5       15  Category 1     att_2       foo
6       15  Category 2     att_1        30
7       15  Category 2     att_2       bar

(I know the original question probably wants (I.) to have Levels 1 and 2 as multiindex and Level 3 as columns and (II.) asks about other ways than iteration over values in the dict. But I hope this answer is still relevant and useful (I.): to people like me who have tried to find a way to get the nested dict into this shape and google only returns this question and (II.): because other answers involve some iteration as well and I find this approach flexible and easy to read; not sure about performance, though.)

Melkor.cz
  • 1,977
  • 17
  • 15
  • This is great for visualizations and I'm sure if someone had the time they could make it into a function that just takes the dictionary and column list as arguments and infers the nesting depth from the columns. – grofte Apr 17 '23 at 13:28
12

So I used to use a for loop for iterating through the dictionary as well, but one thing I've found that works much faster is to convert to a panel and then to a dataframe. Say you have a dictionary d

import pandas as pd
d
{'RAY Index': {datetime.date(2014, 11, 3): {'PX_LAST': 1199.46,
'PX_OPEN': 1200.14},
datetime.date(2014, 11, 4): {'PX_LAST': 1195.323, 'PX_OPEN': 1197.69},
datetime.date(2014, 11, 5): {'PX_LAST': 1200.936, 'PX_OPEN': 1195.32},
datetime.date(2014, 11, 6): {'PX_LAST': 1206.061, 'PX_OPEN': 1200.62}},
'SPX Index': {datetime.date(2014, 11, 3): {'PX_LAST': 2017.81,
'PX_OPEN': 2018.21},
datetime.date(2014, 11, 4): {'PX_LAST': 2012.1, 'PX_OPEN': 2015.81},
datetime.date(2014, 11, 5): {'PX_LAST': 2023.57, 'PX_OPEN': 2015.29},
datetime.date(2014, 11, 6): {'PX_LAST': 2031.21, 'PX_OPEN': 2023.33}}}

The command

pd.Panel(d)
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 2 (major_axis) x 4 (minor_axis)
Items axis: RAY Index to SPX Index
Major_axis axis: PX_LAST to PX_OPEN
Minor_axis axis: 2014-11-03 to 2014-11-06

where pd.Panel(d)[item] yields a dataframe

pd.Panel(d)['SPX Index']
2014-11-03  2014-11-04  2014-11-05 2014-11-06
PX_LAST 2017.81 2012.10 2023.57 2031.21
PX_OPEN 2018.21 2015.81 2015.29 2023.33

You can then hit the command to_frame() to turn it into a dataframe. I use reset_index as well to turn the major and minor axis into columns rather than have them as indices.

pd.Panel(d).to_frame().reset_index()
major   minor      RAY Index    SPX Index
PX_LAST 2014-11-03  1199.460    2017.81
PX_LAST 2014-11-04  1195.323    2012.10
PX_LAST 2014-11-05  1200.936    2023.57
PX_LAST 2014-11-06  1206.061    2031.21
PX_OPEN 2014-11-03  1200.140    2018.21
PX_OPEN 2014-11-04  1197.690    2015.81
PX_OPEN 2014-11-05  1195.320    2015.29
PX_OPEN 2014-11-06  1200.620    2023.33

Finally, if you don't like the way the frame looks you can use the transpose function of panel to change the appearance before calling to_frame() see documentation here http://pandas.pydata.org/pandas-docs/dev/generated/pandas.Panel.transpose.html

Just as an example

pd.Panel(d).transpose(2,0,1).to_frame().reset_index()
major        minor  2014-11-03  2014-11-04  2014-11-05  2014-11-06
RAY Index   PX_LAST 1199.46    1195.323     1200.936    1206.061
RAY Index   PX_OPEN 1200.14    1197.690     1195.320    1200.620
SPX Index   PX_LAST 2017.81    2012.100     2023.570    2031.210
SPX Index   PX_OPEN 2018.21    2015.810     2015.290    2023.330

Hope this helps.

Mishiko
  • 341
  • 2
  • 5
  • 11
  • 12
    Panel is deprecated in more recent versions of pandas (v0.23 at the time of writing). – cs95 Jan 22 '19 at 03:42
5

Building on verified answer, for me this worked best:

ab = pd.concat({k: pd.DataFrame(v).T for k, v in data.items()}, axis=0)
ab.T
El_1988
  • 339
  • 3
  • 13
5

For other ways to represent the data, you don't need to do much. For example, if you just want the "outer" key to be an index, the "inner" key to be columns and the values to be cell values, this would do the trick:

df = pd.DataFrame.from_dict(user_dict, orient='index')


Pss
  • 553
  • 4
  • 12
  • I get error: `Anaconda3\lib\site-packages\pandas\core\internals\construction.py:309: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray values = np.array([convert(v) for v in values])` – PM0087 Apr 05 '21 at 18:29
  • Does not work for nested dict where the values stay in dictionary form. Still good if you want the right output shape (transposed) to start with. – questionto42 Dec 20 '21 at 15:50
  • It works for `dict()` nested up to two levels. I have not tested any more levels. – odunayo12 Feb 15 '22 at 10:32