0

I'm currently struggling with creating a dataframe based on a dictionary that is nested like {key1:[{key:value},{key:value}, ...],key2:[{key:value},{key:value},...]} And I want this to go into a dataframe, where the value of key1 and key2 are the index, while the list nested key:value pairs would become the column and record values.

Now, for each key1, key2, etc the list key:value pairs can be different in size. Example data:

some_dict = {'0000297386FB11E2A2730050568F1BAB': [{'FILE_ID': '0000297386FB11E2A2730050568F1BAB'},
  {'FileTime': '1362642335'},
  {'Size': '1016439'},
  {'DocType_Code': 'AF3BD580734A77068DD083389AD7FDAF'},
  {'Filenr': 'F682B798EC9481FF031C4C12865AEB9A'},
  {'DateRegistered': 'FAC4F7F9C3217645C518D5AE473DCB1E'},
  {'TITLE': '2096158F036B0F8ACF6F766A9B61A58B'}],
 '000031EA51DA11E397D30050568F1BAB': [{'FILE_ID': '000031EA51DA11E397D30050568F1BAB'},
  {'FileTime': '1384948248'},
  {'Size': '873514'},
  {'DatePosted': '7C6BCB90AC45DA1ED6D1C376FC300E7B'},
  {'DocType_Code': '28F404E9F3C394518AF2FD6A043D3A81'},
  {'Filenr': '13A6A062672A88DE75C4D35917F3C415'},
  {'DateRegistered': '8DD4262899F20DE45F09F22B3107B026'},
  {'Comment': 'AE207D73C9DDB76E1EEAA9241VJGN02'},
  {'TITLE': 'DF96336A6FE08E34C5A94F6A828B4B62'}]}

The final result should look like this:

Index | File_ID | ... | DatePosted | ... | Comment | Title
0000297386FB11E2A2730050568F1BAB|0000297386FB11E2A2730050568F1BAB|...|NaN|...|NaN|2096158F036B0F8ACF6F766A9B61A58B
000031EA51DA11E397D30050568F1BAB|000031EA51DA11E397D30050568F1BAB|...|7C6BCB90AC45DA1ED6D1C376FC300E7B|...|AE207D73C9DDB76E1EEAA9241VJGN02|DF96336A6FE08E34C5A94F6A828B4B62

Now I've tried to parse the dict directly to pandas using comprehension as suggested in Creating dataframe from a dictionary where entries have different lengths and tried to flatten the dict more, and then parsing it to pandas Flatten nested dictionaries, compressing keys. Both with no avail.

Wokkel
  • 319
  • 4
  • 12
  • have you tried 'from_dict'? https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.from_dict.html – kalzso Dec 13 '22 at 12:51
  • Yes, and that results only into a long dataframe, with `[key:value]` pairs as direct input for the records. Resulting in `{'Size': '873514'}` as a row under column X. Only the `key1, key2, keyN` values are transposed as the index. – Wokkel Dec 13 '22 at 12:57

2 Answers2

2

Here you go.

You do not need key of first dict. Because it's also available in lower stages. Then you need to merge multiple dicts into single one. I did that with update. THen we turn dict into pd series. And concat it into a dataframe.

In [39]: seriess = []
    ...: for values in some_dict.values():
    ...:     d = {}
    ...:     for thing in values:
    ...:         d.update(thing)
    ...:     s = pd.Series(d)
    ...:     seriess.append(s)
    ...:

In [40]: pd.concat(seriess,axis=1).T
Out[40]:
                            FILE_ID    FileTime     Size  ...                             TITLE                        DatePosted                          Comment
0  0000297386FB11E2A2730050568F1BAB  1362642335  1016439  ...  2096158F036B0F8ACF6F766A9B61A58B                               NaN                              NaN
1  000031EA51DA11E397D30050568F1BAB  1384948248   873514  ...  DF96336A6FE08E34C5A94F6A828B4B62  7C6BCB90AC45DA1ED6D1C376FC300E7B  AE207D73C9DDB76E1EEAA9241VJGN02
Aidis
  • 1,272
  • 4
  • 14
  • 31
0

Let's try the following code:

dfs = []
for k in some_dict.keys():
    dfs.append(pd.DataFrame.from_records(some_dict[k]))
    
new_df = [dfs[0].append(x) for x in dfs[1:]][0]

final_result = (new_df
                .groupby(new_df['FILE_ID'].notna().cumsum())
                .first())

Output

    FILE_ID FileTime    Size    DocType_Code    Filenr  DateRegistered  TITLE   DatePosted  Comment
FILE_ID                                 
1   0000297386FB11E2A2730050568F1BAB    1362642335  1016439 AF3BD580734A77068DD083389AD7FDAF    F682B798EC9481FF031C4C12865AEB9A    FAC4F7F9C3217645C518D5AE473DCB1E    2096158F036B0F8ACF6F766A9B61A58B    None    None
2   000031EA51DA11E397D30050568F1BAB    1384948248  873514  28F404E9F3C394518AF2FD6A043D3A81    13A6A062672A88DE75C4D35917F3C415    8DD4262899F20DE45F09F22B3107B026    DF96336A6FE08E34C5A94F6A828B4B62    7C6BCB90AC45DA1ED6D1C376FC300E7B    AE207D73C9DDB76E1EEAA9241VJGN02
Khaled DELLAL
  • 871
  • 4
  • 16