0

Original question: I am using python 3. I have some 4 level dictionary and 5 level dictionary. I want to convert this multilevel dictionary into a pandas DataFrame with a recursive function

To simplified my question and test my function, I generated a 3 level dictionary as shown below and to try my recursive function. I understand that with this 3 levels nested dictionary, there are many other ways to solve the problem. But, I feel only recursive function can be easily applied to solve the problem on 4 levels, 5 levels or more levels dictionary

To create a simplified 3-level dictionary:


from collections import defaultdict
def ddict():
    return defaultdict(ddict)

tree = ddict()
tree['level1_1']['level2_1']['level3_1'] = <pd.Series1>
tree['level1_1']['level2_1']['level3_2'] = <pd.Series2>
tree['level1_1']['level2_2']['level3_1'] = <pd.Series3>
tree['level1_1']['level2_2']['level3_2'] = <pd.Series4>
tree['level1_2']['level2_1']['level3_1'] = <pd.Series5>
tree['level1_2']['level2_1']['level3_2'] = <pd.Series6>
tree['level1_2']['level2_2']['level3_1'] = <pd.Series7>
tree['level1_2']['level2_2']['level3_2'] = <pd.Series8>

Inspired by Bart Cubrich below, I revised xx's code and put my solution here

import collections
def tree2df (d, colname):
    """
    Inputs:
        1. d               (a nested dict, or a tree, all values are pd.Series)
        2. colname         (a list)

    Return:
        1. a pd.DataFrame  
    """
    def flatten(d, parent_key='', sep='-'):
        items = []
        for k, v in d.items():
            new_key = str(parent_key) + str(sep) + str(k) if parent_key else k
            if isinstance(v, collections.MutableMapping):
                items.extend(flatten(v, new_key, sep=sep).items())
            else:
                items.append((new_key, v))
        return dict(items)
    flat_dict = flatten (d)  
    levels, vals = zip(*[(tuple(level.split('-')),val) for level, val in flat_dict.items()])
    max_level = np.max(np.array([len(l) for l in levels]))
    if len(colname) != max_level:
        print ("The numbers of column name is invalid because of moer than maximum level: %s.\nNothing will be returned. Please revise the colname!"%max_level)
    else:
        colname += ['Old index']
        s = pd.concat(list(vals), keys = list(levels), names = colname)
        s = pd.DataFrame(s)
        s.reset_index(inplace=True)
        s.rename(columns={0:'Value'},inplace=True)
        return s

#Example
BlockEvent_TS_df = tree2df (BlockEvent_TS_tree, ['ID','Session','Trial type','Block', 'Event name'])

The 5-level nested dictionary is in the same idea as 3-level one:

tree['level1_1']['level2_1']['level3_1']['level4_1']['level5_1'] = <pd.Series1>
                ...
tree['level1_2']['level2_2']['level3_2']['level4_2']['level5_2'] = <pd.Series32>

Because I have a large dataset, so it's very complicated to show the whole nested dictionary here. But, the idea is like this. And later on, I want to have 6 col, 5 col to store each level and one column is for value.

I've tried the code above and it works well for me. The speed is also very decent.

Thanks for all your help!

Eric
  • 327
  • 1
  • 3
  • 9
  • Can you provide an example of there actual 5 level dictionary to make it more clear please? – bart cubrich Aug 01 '19 at 19:33
  • The data set is too big to actually type here. But, it should be like the same way as 3 level dictionary to be created. Just add more levels, and for each level, I want to convert to a column in dataframe. I can edit my post to illustrate more. – Eric Aug 01 '19 at 19:34
  • Is there a specific reason you are using `defaultdict`? – Akaisteph7 Aug 01 '19 at 19:38
  • Also, what is your question? What is the problem with your code? – Akaisteph7 Aug 01 '19 at 19:39
  • Possible duplicate of [Flatten nested dictionaries, compressing keys](https://stackoverflow.com/questions/6027558/flatten-nested-dictionaries-compressing-keys) – Green Cloak Guy Aug 01 '19 at 19:40
  • I saw Flatten nested dictionaries, compressing keys this one. But, it's not the way I want to. – Eric Aug 01 '19 at 19:42
  • I use defaultdict so that I can create a nested dictionary easily. As I said, it's just a very simple way to create a trial model. But, now the problem is I have a data set which is in nested dictionary way and I am curious how to convert to a dataframe. – Eric Aug 01 '19 at 19:44

3 Answers3

3

You need:

format_ = {(level1_key, level2_key, level3_key): values
 for level1_key, level2_dict in tree.items()
 for level2_key, level3_dict in level2_dict.items()
 for level3_key, values      in level3_dict.items()}
df = pd.DataFrame(format_, index=['Value']).T.reset_index()

Output:

     level_0    level_1      level_2    Value
0   level1_1    level2_1    level3_1    1
1   level1_1    level2_1    level3_2    2
2   level1_1    level2_2    level3_1    3
3   level1_1    level2_2    level3_2    4
4   level1_2    level2_1    level3_1    5
5   level1_2    level2_1    level3_2    6
6   level1_2    level2_2    level3_1    7
7   level1_2    level2_2    level3_2    8
harvpan
  • 8,571
  • 2
  • 18
  • 36
  • I agree this is a way to do it. But, if facing a 4-level dictionary, I have two write a few more lines. Problem is that if I have many many nested dictionary dataset and their level are not the same. So, I cannot write thsi code for each one of the multilevel dictionary. That's why I want to use recursion function to have it unfold by itself and can be concatenated into a dataframe. – Eric Aug 01 '19 at 19:48
0

So my solution would be to traverse the tree looking at all the keys and building out each elements path as an array then create the DataFrame from records. I split each of these steps into their own method.

There may be a more efficient approach, but this should get the job done. Hope this helps.

def traverse_tree(d, prefix='', results=[]):
    if type(d) is int:
        record = str(prefix).split(',')
        record.append(d)
        results.append(record)
        return results
    keys = d.keys()
    for key in keys:
        temp = prefix + ',' if prefix != '' else ''
        results = traverse_tree(d[key], temp + str(key), results)
    return results


def dict_to_df(d):
    res = traverse_tree(tree)
    labels = []
    for i in range(len(res[0]) - 1):
        labels.append('L' + str(i+1))
    labels.append('Value')
    print(res)
    print(labels)
    return pd.DataFrame.from_records(res, columns=labels)


if __name__ == '__main__':
    tree = ddict()
    tree['level1_1']['level2_1']['level3_1'] = 1
    tree['level1_1']['level2_1']['level3_2'] = 2
    tree['level1_1']['level2_2']['level3_1'] = 3
    tree['level1_1']['level2_2']['level3_2'] = 4
    tree['level1_2']['level2_1']['level3_1'] = 5
    tree['level1_2']['level2_1']['level3_2'] = 6
    tree['level1_2']['level2_2']['level3_1'] = 7
    tree['level1_2']['level2_2']['level3_2'] = 8
    df = dict_to_df(tree)
    print(df)
0

This version will work when there are various level depths, though it is messy looking.

import pandas as pd

from collections import defaultdict
def ddict():
    return defaultdict(ddict)

tree = ddict()
tree['level1_1']['level2_1']['level3_1'] = 1
tree['level1_1']['level2_1']['level3_2'] = 2
tree['level1_1']['level2_2']['level3_1'] = 3
tree['level1_1']['level2_2']['level3_2'] = 4
tree['level1_2']['level2_1']['level3_1'] = 5
tree['level1_2']['level2_1']['level3_2'] = 6
tree['level1_2']['level2_2']['level3_1'] = 7
tree['level1_2']['level2_2']['level3_2']['Level4_1'] = 8

import collections

def flatten(d, parent_key='', sep='-'):
    items = []
    for k, v in d.items():
        new_key = parent_key + sep + k if parent_key else k
        if isinstance(v, collections.MutableMapping):
            items.extend(flatten(v, new_key, sep=sep).items())
        else:
            items.append((new_key, v))
    return dict(items)

flat_dict=flatten(tree)

#df=pd.DataFrame()
levels=[]
vals=[]
for key in flat_dict.keys():
    levels.append(key.split('-'))
    vals.append(flat_dict.get(key))



max_level=0
for level in levels:
    if len(level)>max_level: max_level=len(level)

df=pd.DataFrame(columns=range(max_level+1))

index=0

for level,val in zip(levels,vals):
    for i in range(max_level):
        try: 
            level[i]
            df.loc[index,i]=level[i]
        except IndexError:
            print('means this level has less than max')

        df.loc[index,max_level]=val

    index+=1

df

Out:

          0         1         2         3  4
0  level1_1  level2_1  level3_1       NaN  1
1  level1_1  level2_1  level3_2       NaN  2
2  level1_1  level2_2  level3_1       NaN  3
3  level1_1  level2_2  level3_2       NaN  4
4  level1_2  level2_1  level3_1       NaN  5
5  level1_2  level2_1  level3_2       NaN  6
6  level1_2  level2_2  level3_1       NaN  7
7  level1_2  level2_2  level3_2  Level4_1  8

I got the flatten idea from Here

bart cubrich
  • 1,184
  • 1
  • 14
  • 41
  • I tried your method. And it works well. But, I feel the speed is a little slow. The final dataframe has 13190 rows * 7 columns. So, I think I have work on speed up this method a little bit. But, thank you very much. – Eric Aug 02 '19 at 05:00
  • You can get rid of the print statement in the `except' as well. That was just for illustrative purposes. – bart cubrich Aug 02 '19 at 17:00