Pandas dataframe groups to nested dict

Question

I have a dataframe (below) that I need to turn into a nested dict by PERSON_ID, then YEAR. I've been able to do it this way;

frame = frame.T.to_dict('dict')

But unfortunately I need it structured like this;

{76129: {1951: 'IN': 3.77551684175021, 'OUT': 6.02818626979883,
         1952: 'IN': 3.67945267132245, 'OUT': 1.7685974058508,
         1953: 'IN': 3.53030183426851, 'OUT': 0.409577500579766}}

If anyone has ideas on how to approach this I would appreciate it.

	PERSON_ID	YEAR	IN	OUT
0	76129	1951	3.77551684175021	6.02818626979883
1	76224	1951	9.3791597299824	9.53608578598666
2	76250	1951	0.729347478193212	5.74296130666972
3	76322	1951	0.922030969294425	8.95933733613574
4	76129	1952	3.67945267132245	1.7685974058508
5	76224	1952	2.43404429471111	7.97540821827656
6	76250	1952	7.26162056498856	9.76505935514356
7	76322	1952	8.66970822529531	7.50026191441197
8	76129	1953	3.53030183426851	0.409577500579766
9	76224	1953	3.45390554224515	3.20774562896629
10	76250	1953	6.63976713572943	5.48027529875715
11	76322	1953	7.87048287939222	0.610433799575476

If you're only trying to export JSON, see the existing JSON questions. Do you actually really need this nested dict internally? Seems unlikely. If so can you explain why? — smci, Jan 11 '22 at 01:05
Your desired output is not a valid nested dictionary (`invalid syntax`). Can you confirm what the desired output should be? — Bill, Jan 11 '22 at 01:10
Nick: **do you only want to export JSON or not? If yes, this is a duplicate of [Convert Pandas DataFrame to JSON format](https://stackoverflow.com/questions/39257147/convert-pandas-dataframe-to-json-format)**. (pandas itself also has a [`df.to_json()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_json.html) but it's not flexible enough to create the nested dict/grouping you want.) — smci, Jan 11 '22 at 01:16
@smci even with the trailing '}}' that has now been added this is still not a valid dict. When I try it I get `SyntaxError: invalid syntax`. — Bill, Jan 11 '22 at 01:22
@Bill: then OP is missing the `[...]` brackets, within each record. That's why I keep asking "is this just trying to export json"?? Because if yes, that's reinventing the wheel, and should be closed as duplicate. — smci, Jan 11 '22 at 01:29
@smci sorry had to step away for a bit but no I'm not trying to export it to json. I suppose it doesn't have to be in that format but I thought the structure made the most sense. This isn't exactly familiar territory for me. — Nick, Jan 11 '22 at 02:42
But **why do you want a nested dict, internally in pandas?** It's generally a pain to deal with, hence a bad idea. More pandas-friendly to have a `groupby('PERSON_ID')`, then `groupby('YEAR')`. Or you could use pandas `merge/join`. What are you ultimately trying to compute/display/tabulate? You need to supply more context. — smci, Jan 11 '22 at 02:45
@smci Well let me ask you this because. If I have a PERSON ID and I need to pull a lot of information from alot of different sources, would it be more efficient pull that data from a dict or dataframe? — Nick, Jan 11 '22 at 02:59
If PERSON_ID is your primary identifier across a lot of distinct dataframes/ SQL tables/ CSV files, the pandas(/SQL) idiom is to read them in and do a `merge/join` on PERSON_ID field. But like I keep saying, you have to *show us the wider context of what you're ultimately tryng to do*. What are those other datasources: activity logs? personal details? census record? mailing addresses? movie reviews? (Not just insist on creating some arbitrary data-structure.) — smci, Jan 11 '22 at 03:04
@smci The endgame is a data structure I can access quickly to pull any data on a person. The data is basketball players. General info (birthdate etc) and year specific info (weight, etc). I can only build the structure if I pull the data from other sources and each player has an ID for 1 site and a different ID for another. So I first have to match the ID's based on similarities in their names and birthdates. Then the player information can be pulled. I was doing something similar to what you suggested with merging but it got really messy so I thought I would try a different approach. — Nick, Jan 11 '22 at 03:15
@smci I thought about asking a general question about how to approach the entire project but I wasn't sure if that's even allowed. — Nick, Jan 11 '22 at 03:17
Nick: well, just add some more detail above the current thing, as context to this specific code question. (It's not like you're asking people to write your entire project.) That's ok. — smci, Jan 11 '22 at 04:17

score 2 · Answer 1 · answered Jan 11 '22 at 01:05

2

we ned to first set the index , the with groupby to_dict in order to create the multiple level dict

d = df.set_index('YEAR').groupby('PERSON_ID').apply(lambda x : x.drop('PERSON_ID',axis = 1).to_dict('index')).to_dict()

Sample output

d[76129]
{1951: {'IN': 3.77551684175021, 'OUT': 6.02818626979883}, 1952: {'IN': 3.67945267132245, 'OUT': 1.7685974058508}, 1953: {'IN': 3.53030183426851, 'OUT': 0.409577500579766}}

answered Jan 11 '22 at 01:05

BENY

317,841
20
164
234

If the OP only wants to export JSON, this is a duplicate of many existing questions. – smci Jan 11 '22 at 01:18

Bill · Accepted Answer · 2022-01-11T01:27:48.450

Assuming you actually want a nested dictionary like this (note the extra braces):

{76129: {1951: {'IN': 3.77551684175021, 'OUT': 6.02818626979883},
         1952: {'IN': 3.67945267132245, 'OUT': 1.7685974058508},
         1953: {'IN': 3.53030183426851, 'OUT': 0.409577500579766}},
 ... etc.
}

Here is a step-by-step approach.

First, create a dataframe with the desired (PERSON_ID, YEAR) multi-index:

frame_sorted = frame.set_index(['PERSON_ID', 'YEAR']).sort_index()
print(frame_sorted)

Output:

                      IN       OUT
PERSON_ID YEAR                    
76129     1951  3.775517  6.028186
          1952  3.679453  1.768597
          1953  3.530302  0.409578
... etc.

Then, created the nested dict using a nested dictionary comprehension:

person_ids = frame_sorted.index.levels[0]
data_dict = {person: {idx: data.to_dict() for idx, data in frame_sorted.loc[person].iterrows()}
             for person in person_ids}
print(data_dict)

Output

{76129: {1951: {'IN': 3.77551684175021, 'OUT': 6.02818626979883},
  1952: {'IN': 3.67945267132245, 'OUT': 1.7685974058508},
  1953: {'IN': 3.53030183426851, 'OUT': 0.409577500579766}},
 ...etc.

Pandas dataframe groups to nested dict

2 Answers2