1

I have a dataframe (below) that I need to turn into a nested dict by PERSON_ID, then YEAR. I've been able to do it this way;

frame = frame.T.to_dict('dict')

But unfortunately I need it structured like this;

{76129: {1951: 'IN': 3.77551684175021, 'OUT': 6.02818626979883,
         1952: 'IN': 3.67945267132245, 'OUT': 1.7685974058508,
         1953: 'IN': 3.53030183426851, 'OUT': 0.409577500579766}}

If anyone has ideas on how to approach this I would appreciate it.

PERSON_ID YEAR IN OUT
0 76129 1951 3.77551684175021 6.02818626979883
1 76224 1951 9.3791597299824 9.53608578598666
2 76250 1951 0.729347478193212 5.74296130666972
3 76322 1951 0.922030969294425 8.95933733613574
4 76129 1952 3.67945267132245 1.7685974058508
5 76224 1952 2.43404429471111 7.97540821827656
6 76250 1952 7.26162056498856 9.76505935514356
7 76322 1952 8.66970822529531 7.50026191441197
8 76129 1953 3.53030183426851 0.409577500579766
9 76224 1953 3.45390554224515 3.20774562896629
10 76250 1953 6.63976713572943 5.48027529875715
11 76322 1953 7.87048287939222 0.610433799575476
smci
  • 32,567
  • 20
  • 113
  • 146
Nick
  • 367
  • 4
  • 16
  • If you're only trying to export JSON, see the existing JSON questions. Do you actually really need this nested dict internally? Seems unlikely. If so can you explain why? – smci Jan 11 '22 at 01:05
  • Your desired output is not a valid nested dictionary (`invalid syntax`). Can you confirm what the desired output should be? – Bill Jan 11 '22 at 01:10
  • @Bill: missing trailing '}}' – smci Jan 11 '22 at 01:13
  • Nick: **do you only want to export JSON or not? If yes, this is a duplicate of [Convert Pandas DataFrame to JSON format](https://stackoverflow.com/questions/39257147/convert-pandas-dataframe-to-json-format)**. (pandas itself also has a [`df.to_json()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_json.html) but it's not flexible enough to create the nested dict/grouping you want.) – smci Jan 11 '22 at 01:16
  • @smci even with the trailing '}}' that has now been added this is still not a valid dict. When I try it I get `SyntaxError: invalid syntax`. – Bill Jan 11 '22 at 01:22
  • @Bill: then OP is missing the `[...]` brackets, within each record. That's why I keep asking "is this just trying to export json"?? Because if yes, that's reinventing the wheel, and should be closed as duplicate. – smci Jan 11 '22 at 01:29
  • @smci sorry had to step away for a bit but no I'm not trying to export it to json. I suppose it doesn't have to be in that format but I thought the structure made the most sense. This isn't exactly familiar territory for me. – Nick Jan 11 '22 at 02:42
  • But **why do you want a nested dict, internally in pandas?** It's generally a pain to deal with, hence a bad idea. More pandas-friendly to have a `groupby('PERSON_ID')`, then `groupby('YEAR')`. Or you could use pandas `merge/join`. What are you ultimately trying to compute/display/tabulate? You need to supply more context. – smci Jan 11 '22 at 02:45
  • @smci Well let me ask you this because. If I have a PERSON ID and I need to pull a lot of information from alot of different sources, would it be more efficient pull that data from a dict or dataframe? – Nick Jan 11 '22 at 02:59
  • If PERSON_ID is your primary identifier across a lot of distinct dataframes/ SQL tables/ CSV files, the pandas(/SQL) idiom is to read them in and do a `merge/join` on PERSON_ID field. But like I keep saying, you have to *show us the wider context of what you're ultimately tryng to do*. What are those other datasources: activity logs? personal details? census record? mailing addresses? movie reviews? (Not just insist on creating some arbitrary data-structure.) – smci Jan 11 '22 at 03:04
  • @smci The endgame is a data structure I can access quickly to pull any data on a person. The data is basketball players. General info (birthdate etc) and year specific info (weight, etc). I can only build the structure if I pull the data from other sources and each player has an ID for 1 site and a different ID for another. So I first have to match the ID's based on similarities in their names and birthdates. Then the player information can be pulled. I was doing something similar to what you suggested with merging but it got really messy so I thought I would try a different approach. – Nick Jan 11 '22 at 03:15
  • @smci I thought about asking a general question about how to approach the entire project but I wasn't sure if that's even allowed. – Nick Jan 11 '22 at 03:17
  • Nick: well, just add some more detail above the current thing, as context to this specific code question. (It's not like you're asking people to write your entire project.) That's ok. – smci Jan 11 '22 at 04:17

2 Answers2

2

we ned to first set the index , the with groupby to_dict in order to create the multiple level dict

d = df.set_index('YEAR').groupby('PERSON_ID').apply(lambda x : x.drop('PERSON_ID',axis = 1).to_dict('index')).to_dict()

Sample output

d[76129]
{1951: {'IN': 3.77551684175021, 'OUT': 6.02818626979883}, 1952: {'IN': 3.67945267132245, 'OUT': 1.7685974058508}, 1953: {'IN': 3.53030183426851, 'OUT': 0.409577500579766}}
BENY
  • 317,841
  • 20
  • 164
  • 234
  • If the OP only wants to export JSON, this is a duplicate of many existing questions. – smci Jan 11 '22 at 01:18
2

Assuming you actually want a nested dictionary like this (note the extra braces):

{76129: {1951: {'IN': 3.77551684175021, 'OUT': 6.02818626979883},
         1952: {'IN': 3.67945267132245, 'OUT': 1.7685974058508},
         1953: {'IN': 3.53030183426851, 'OUT': 0.409577500579766}},
 ... etc.
}

Here is a step-by-step approach.

First, create a dataframe with the desired (PERSON_ID, YEAR) multi-index:

frame_sorted = frame.set_index(['PERSON_ID', 'YEAR']).sort_index()
print(frame_sorted)

Output:

                      IN       OUT
PERSON_ID YEAR                    
76129     1951  3.775517  6.028186
          1952  3.679453  1.768597
          1953  3.530302  0.409578
... etc.

Then, created the nested dict using a nested dictionary comprehension:

person_ids = frame_sorted.index.levels[0]
data_dict = {person: {idx: data.to_dict() for idx, data in frame_sorted.loc[person].iterrows()}
             for person in person_ids}
print(data_dict)

Output

{76129: {1951: {'IN': 3.77551684175021, 'OUT': 6.02818626979883},
  1952: {'IN': 3.67945267132245, 'OUT': 1.7685974058508},
  1953: {'IN': 3.53030183426851, 'OUT': 0.409577500579766}},
 ...etc.
Bill
  • 10,323
  • 10
  • 62
  • 85