0

I am not sure how to word this properly, so I will try to explain this with a replicable example.

I have thousands of entries in a pandas.DataFrame object. I want to export each row as its own json file with a few keys that are not explicitly available in the data frame's structure.

My data frame, df, looks as follows:

> df = pd.DataFrame({'ID':['0','1','2'],
                     'NAME':['jimmy','james','joben'],
                     'LAST':['neutron','baxter','segel'],
                     'ADDRESS':['101 ocean avenue','202 bubble gum county','303 china town'],
                     'COUNTY':['yellow card park','candy kingdom','universal studio']})
ID NAME LAST ADDRESS COUNTY
0 jimmy neutron 101 ocean avenue yellow card park
1 james baxter 202 bubble gum county candy kingdom
2 joben segel 303 china town universal studio

I want to transform each row to json files with the following structure:

  • 'ID'
  • 'PERSONAL', which encloses NAME and LAST
  • 'GEO', which encloses ADDRESS and COUNTY

In other words, I would like the first transformed row to look as follows:

{
    'ID':'0',
    'PERSONAL':
        {
            'NAME':'jimmy',
            'LAST':'neutron'
        },
    'GEO':
        {
            'ADDRESS':'101 ocean avenue',
            'COUNTY':'yellow card park'
        }
}

I was thinking of converting df into a pandas.MultiIndex structure to add the PERSONAL and GEO indices and then export each row with pandas.DataFrame.to_json. However, I'm not sure if this will work and wanted to ask around prior to trying this.

Arturo Sbr
  • 5,567
  • 4
  • 38
  • 76

2 Answers2

2

Pandas is equipped for this out of the box.

pandas.DataFrame.to_json

here is the example dataframe:

import json
df = pd.DataFrame(
    [["a", "b"], ["c", "d"]],
    index=["row 1", "row 2"],
    columns=["col 1", "col 2"],
)

Here is the result using to_json():

result = df.to_json(orient="split")
parsed = json.loads(result)
json.dumps(parsed, indent=4)  
{
    "columns": [
        "col 1",
        "col 2"
    ],
    "index": [
        "row 1",
        "row 2"
    ],
    "data": [
        [
            "a",
            "b"
        ],
        [
            "c",
            "d"
        ]
    ]
}

here is the link: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html

D.L
  • 4,339
  • 5
  • 22
  • 45
  • I upvoted this because I appreciate the effort. However, I am confused as to how to apply `to_json` to my problem. Your example has the column names as its own key, rather than setting new keys (PERSONAL and GEO) and assigning certain columns to each one. Can you expand on this please? – Arturo Sbr May 26 '21 at 20:18
  • it seems from the next answer below that it has been done for you. However, you would just replace the column names `columns=` with your own columns headings and the data in the table with your own data. `df.columns = ['ID', 'PERSONAL.NAME', 'PERSONAL.LAST', 'GEO.ADDRESS', 'GEO.COUNTY']` – D.L May 27 '21 at 20:34
1

As per the function provided here @Parsa T. You can just change the column names and use the function to get the required result.

def set_for_keys(my_dict, key_arr, val):
    """
    Set value at the path in my_dict defined by the string (or serializable object) array key_arr
    """
    current = my_dict
    for i in range(len(key_arr)):
        key = key_arr[i]
        if key not in current:
            current[key] = val if i==len(key_arr)-1 else {}
        else:
            if type(current[key]) is not dict:
                print("Given dictionary is not compatible with key structure requested")
                raise ValueError("Dictionary key already occupied")

        current = current[key]

    return my_dict

def to_formatted_json(df, sep="."):
    result = []
    for _, row in df.iterrows():
        parsed_row = {}
        for idx, val in row.iteritems():
            keys = idx.split(sep)
            parsed_row = set_for_keys(parsed_row, keys, val)

        result.append(parsed_row)
    return result


df.columns = ['ID', 'PERSONAL.NAME', 'PERSONAL.LAST', 'GEO.ADDRESS', 'GEO.COUNTY']
#Where df was parsed from json-dict using json_normalize
print(to_formatted_json(df, sep="."))

OUTPUT:

[{'ID': '0',
  'PERSONAL': {'NAME': 'jimmy', 'LAST': 'neutron'},
  'GEO': {'ADDRESS': '101 ocean avenue', 'COUNTY': 'yellow card park'}},
 {'ID': '1',
  'PERSONAL': {'NAME': 'james', 'LAST': 'baxter'},
  'GEO': {'ADDRESS': '202 bubble gum county', 'COUNTY': 'candy kingdom'}},
 {'ID': '2',
  'PERSONAL': {'NAME': 'joben', 'LAST': 'segel'},
  'GEO': {'ADDRESS': '303 china town', 'COUNTY': 'universal studio'}}]
Nk03
  • 14,699
  • 2
  • 8
  • 22