2

I have a task to transform data from csv file to nested JSON file using python for web usage. I tried the Python code in this article. The desired output would be one member_id shows one time in the JSON file, same for the tag_name under one member_id. The problem is when I tried groupby with only member_id, the tag_name 'm1' would show multiple times under 'abc123' for example. If I tried groupby with both member_id and tag_name, 'abc123' would show up two times for tag 'm1' and 'm2'. I've been googling for a while but most resolutions only for one nested(not sure if I use the right term). Please let me know if any possible way to do this.

Sample Code:

import json
import pandas as pd
df = pd.read_csv('../detail.csv', sep=',', header=0
                 , index_col=False
                 , dtype = {'member_id':str,'tag_name':str,'detail_name':str,'detail_value':str} )
group = df.groupby(['member_id','tag_name'])

finalList, finalDict = [], {}
for key, value in group:
    dictionary, dictionary1, dictList, dictList1 = {}, {}, [], []
    j = group.get_group(key).reset_index(drop=True)
    dictionary['member_id'] = j.at[0,'member_id']
    dictionary1['tag_name'] = j.at[0,'tag_name']
    
    for i in j.index:
        anotherDict = {}
        anotherDict['detail_name'] = j.at[i,'detail_name']
        anotherDict['detail_value'] = j.at[i,'detail_value']
        dictList1.append(anotherDict.copy())
        dictionary1['detail'] = dictList1 
     
    dictList.append(dictionary1)
    dictionary['tag'] = dictList
    finalList.append(dictionary)

json.dumps(finalList,ensure_ascii = False)

detail.csv:

member_id, tag_name, detail_name, detail_value
-------------------------------------------------------
abc123, m1, Service_A, 20
abc123, m1, Service_B, 20
abc123, m2, Service_C, 10
xyz456, m3, Service A, 5
xyz456, m3, Service A, 10

Desired Output JSON:

{   "member_id": "abc123",
    "tag":[ {"tag_name": "m1",
            "detail":[{ "detail_name": "Service_A",
                        "detail_value": "20"},
                    {   "detail_name": "Service_B",
                        "detail_value": "20"}]},
            {"tag_name": "m2",
            "detail":[{ "detail_name": "Service_C",
                        "detail_value": "10"}]}]},
{   "member_id": "xyz456",
    "tag":[{"tag_name": "m3",
            "detail":[{ "detail_name": "Service_A",
                        "detail_value": "5"},
                      { "detail_name": "Service_A",
                        "detail_value": "10"}]}]}
Aaron.K
  • 25
  • 4

1 Answers1

1

I'm not aware of pandas functions allowing to do achieve that directly. Moreover, you introduce keys that are not part of your initial dataframe (tag, detail). So implementing a generic solution seems difficult.

However, if you don't have more columns than stated in your question, you can go through your dataframe, grouping column after column:

result = []

for member_id, member_df in df.groupby('member_id'):
    member_dict = {'member_id': member_id}
    member_dict['tag'] = []
    for tag_name, tag_df in member_df.groupby('tag_name'):
        tag_dict = {'tag_name': tag_name}
        tag_dict['detail'] = []
        for detail_name, detail_df in tag_df.groupby('detail_name'):
            detail_dict = {'detail_name': detail_name}
            detail_dict['detail_value'] = detail_df.detail_value.mean() # should be only one value, taking 'mean' just in case
            tag_dict['detail'].append(detail_dict)
        member_dict['tag'].append(tag_dict)
    result.append(member_dict)

print(json.dumps(result, indent=4))

Output:

[
    {
        "member_id": "abc123",
        "tag": [
            {
                "tag_name": "m1",
                "detail": [
                    {
                        "detail_name": "Service_A",
                        "detail_value": 20.0
                    },
                    {
                        "detail_name": "Service_B",
                        "detail_value": 20.0
                    }
                ]
            },
            {
                "tag_name": "m2",
                "detail": [
                    {
                        "detail_name": "Service_C",
                        "detail_value": 10.0
                    }
                ]
            }
        ]
    },
    {
        "member_id": "xyz456",
        "tag": [
            {
                "tag_name": "m3",
                "detail": [
                    {
                        "detail_name": "Service A",
                        "detail_value": 5.0
                    }
                ]
            }
        ]
    }
]
Edit: if you don't want unique detail names in your lists it's shorter:
result = []

for member_id, member_df in df.groupby('member_id'):
    member_dict = {'member_id': member_id}
    member_dict['tag'] = []
    for tag_name, tag_df in member_df.groupby('tag_name'):
        tag_dict = {'tag_name': tag_name}
        tag_dict['detail'] = tag_df[['detail_name', 'detail_value']].to_dict(orient='records')
        member_dict['tag'].append(tag_dict)
    result.append(member_dict)

print(json.dumps(result, indent=4))
Tranbi
  • 11,407
  • 6
  • 16
  • 33
  • Tranbi, is there any way to improve the performance? the csv might have ten millions rows probably. – Aaron.K Dec 01 '21 at 01:31
  • I can't think of a way to greatly improve performance. But do you really need that big of a json? you could save each `member_dict` to its own json file. That would be easier on memory. – Tranbi Dec 01 '21 at 06:43
  • Understood. just one more thing, if I have a new row like 'xyz456, m3, Service A, 10', and I want them show just in different dict like Service A, Service B under (abc123, m1), how should I modify the code? – Aaron.K Dec 01 '21 at 08:21
  • What do you mean? xyz456 is already in a different dict – Tranbi Dec 01 '21 at 08:24
  • I updated the sample csv and output above, it should be clear. I don't want the detail be grouped by. so I can see all the detail even they're under same detail_name. – Aaron.K Dec 01 '21 at 08:28
  • Tranbi, I have to save each member_dict to it's own json file to see if the performance get improved. could you please guide me? thx! – Aaron.K Dec 08 '21 at 15:43
  • Sure, just use `json.dump` on `member_dict` inside your loop! – Tranbi Dec 08 '21 at 15:48