Flattening nested JSON to pandas.DataFrame: Ordering and Naming Columns based on dictionary values

Question

My question raised when I exploited this helpful answer provided by Trenton McKinney on the issue of flattening multiple nested JSON-files for handling in pandas.
Following his advice, I have used the flatten_json function described here to flatten a batch of nested json files. However, I have run into a problem with the uniformity of my JSON-files.

A single JSON-File looks roughly like this made-up example data:

{
    "product": "example_productname",
    "product_id": "example_productid",
    "product_type": "example_producttype",
    "producer": "example_producer",
    "currency": "example_currency",
    "client_id": "example_clientid",
    "supplement": [
        {
            "supplementtype": "RTZ",
            "price": 300000,
            "rebate": "500",
        },
        {
            "supplementtype": "CVB",
            "price": 500000,
            "rebate": "250",
        },
        {
            "supplementtype": "JKL",
            "price": 100000,
            "rebate": "750",
        },
    ],
}

Utilizing the referenced code, I will end up with data looking like this:

product	product_id	product_type	producer	currency	client_id	supplement_0_supplementtype	supplement_0_price	supplement_0_rebate	supplement_1_supplementtype	supplement_1_price	supplement_1_rebate	etc
example_productname	example_productid	example_type	example_producer	example_currency	example_clientid	RTZ	300000	500	CVB	500000	250	etc
example_productname2	example_productid2	example_type2	example_producer2	example_currency2	example_clientid2	CVB	500000	250	RTZ	300000	500	etc

There are multiple issues with this.
Firstly, in my data, there is a limited list of "supplements", however, they do not always appear, and if they do, they are not always in the same order. In the example table, you can see that the two "supplements" switched positions in the second row. I would prefer a fixed order of the "supplement columns".

Secondly, the best option would be a table like this:

product	product_id	product_type	producer	currency	client_id	supplement_RTZ_price	supplement_RTZ_rebate	supplement_CVB_price	supplement_CVB_rebate	etc
example_productname	example_productid	example_type	example_producer	example_currency	example_clientid	300000	500	500000	250	etc

I have tried editing the flatten_json function referenced, but I don't have an inkling of how to make this work.
The solution consists of simply editing the dictionary (thanks to Andrej Kesely). I just added a pass to exceptions in case some columns are inexistent.

d = {
    "product": "example_productname",
    "product_id": "example_productid",
    "product_type": "example_producttype",
    "producer": "example_producer",
    "currency": "example_currency",
    "client_id": "example_clientid",
    "supplement": [
        {
            "supplementtype": "RTZ",
            "price": 300000,
            "rebate": "500",
        },
        {
            "supplementtype": "CVB",
            "price": 500000,
            "rebate": "250",
        },
        {
            "supplementtype": "JKL",
            "price": 100000,
            "rebate": "750",
        },
    ],
}

for s in d["supplement"]:
    try:
        d["supplementtype_{}_price".format(s["supplementtype"])] = s["price"]
    except:
        pass
    try:
        d["supplementtype_{}_rebate".format(s["supplementtype"])] = s["rebate"]
    except:
        pass

del d["supplement"]

df = pd.DataFrame([d])
print(df)
               product         product_id         product_type          producer          currency         client_id  supplementtype_RTZ_price supplementtype_RTZ_rebate  supplementtype_CVB_price supplementtype_CVB_rebate  supplementtype_JKL_price supplementtype_JKL_rebate
0  example_productname  example_productid  example_producttype  example_producer  example_currency  example_clientid                    300000                       500                    500000                       250                    100000                       750

The used/referenced code:

def flatten_json(nested_json: dict, exclude: list=[''], sep: str='_') -> dict:
    """
    Flatten a list of nested dicts.
    """
    out = dict()
    def flatten(x: (list, dict, str), name: str='', exclude=exclude):
        if type(x) is dict:
            for a in x:
                if a not in exclude:
                    flatten(x[a], f'{name}{a}{sep}')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, f'{name}{i}{sep}')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(nested_json)
    return out

# list of files
files = ['test1.json', 'test2.json']

# list to add dataframe from each file
df_list = list()

# iterate through files
for file in files:
    with open(file, 'r') as f:

        # read with json
        data = json.loads(f.read())

        # flatten_json into a dataframe and add to the dataframe list
        df_list.append(pd.DataFrame.from_dict(flatten_json(data), orient='index').T)
        
# concat all dataframes together
df = pd.concat(df_list).reset_index(drop=True)

Thanks for the help with making the tables work, Tomerikoo. Mind sharing why they rendered in the preview, but not in the actual post? :) — Kaschmir, Apr 09 '21 at 15:52

score 1 · Accepted Answer · answered Apr 09 '21 at 18:33

You can modify the dictionary before you create dataframe from it:

d = {
    "product": "example_productname",
    "product_id": "example_productid",
    "product_type": "example_producttype",
    "producer": "example_producer",
    "currency": "example_currency",
    "client_id": "example_clientid",
    "supplement": [
        {
            "supplementtype": "RTZ",
            "price": 300000,
            "rebate": "500",
        },
        {
            "supplementtype": "CVB",
            "price": 500000,
            "rebate": "250",
        },
        {
            "supplementtype": "JKL",
            "price": 100000,
            "rebate": "750",
        },
    ],
}

for s in d["supplement"]:
    d["supplementtype_{}_price".format(s["supplementtype"])] = s["price"]
    d["supplementtype_{}_rebate".format(s["supplementtype"])] = s["rebate"]

del d["supplement"]

df = pd.DataFrame([d])
print(df)

Prints:

               product         product_id         product_type          producer          currency         client_id  supplementtype_RTZ_price supplementtype_RTZ_rebate  supplementtype_CVB_price supplementtype_CVB_rebate  supplementtype_JKL_price supplementtype_JKL_rebate
0  example_productname  example_productid  example_producttype  example_producer  example_currency  example_clientid                    300000                       500                    500000                       250                    100000                       750

Thank you! This was an easy solution. I have slightly adjusted it to run even when one of the "supplement"-columns is inexistent, and updated my question with it. I guess I have a lot to learn about dictionaries/python data management. — Kaschmir, Apr 12 '21 at 14:54

Flattening nested JSON to pandas.DataFrame: Ordering and Naming Columns based on dictionary values

1 Answers1