1

My question raised when I exploited this helpful answer provided by Trenton McKinney on the issue of flattening multiple nested JSON-files for handling in pandas.
Following his advice, I have used the flatten_json function described here to flatten a batch of nested json files. However, I have run into a problem with the uniformity of my JSON-files.

A single JSON-File looks roughly like this made-up example data:

{
    "product": "example_productname",
    "product_id": "example_productid",
    "product_type": "example_producttype",
    "producer": "example_producer",
    "currency": "example_currency",
    "client_id": "example_clientid",
    "supplement": [
        {
            "supplementtype": "RTZ",
            "price": 300000,
            "rebate": "500",
        },
        {
            "supplementtype": "CVB",
            "price": 500000,
            "rebate": "250",
        },
        {
            "supplementtype": "JKL",
            "price": 100000,
            "rebate": "750",
        },
    ],
}

Utilizing the referenced code, I will end up with data looking like this:

product product_id product_type producer currency client_id supplement_0_supplementtype supplement_0_price supplement_0_rebate supplement_1_supplementtype supplement_1_price supplement_1_rebate etc
example_productname example_productid example_type example_producer example_currency example_clientid RTZ 300000 500 CVB 500000 250 etc
example_productname2 example_productid2 example_type2 example_producer2 example_currency2 example_clientid2 CVB 500000 250 RTZ 300000 500 etc

There are multiple issues with this.
Firstly, in my data, there is a limited list of "supplements", however, they do not always appear, and if they do, they are not always in the same order. In the example table, you can see that the two "supplements" switched positions in the second row. I would prefer a fixed order of the "supplement columns".

Secondly, the best option would be a table like this:

product product_id product_type producer currency client_id supplement_RTZ_price supplement_RTZ_rebate supplement_CVB_price supplement_CVB_rebate etc
example_productname example_productid example_type example_producer example_currency example_clientid 300000 500 500000 250 etc

I have tried editing the flatten_json function referenced, but I don't have an inkling of how to make this work.
The solution consists of simply editing the dictionary (thanks to Andrej Kesely). I just added a pass to exceptions in case some columns are inexistent.

d = {
    "product": "example_productname",
    "product_id": "example_productid",
    "product_type": "example_producttype",
    "producer": "example_producer",
    "currency": "example_currency",
    "client_id": "example_clientid",
    "supplement": [
        {
            "supplementtype": "RTZ",
            "price": 300000,
            "rebate": "500",
        },
        {
            "supplementtype": "CVB",
            "price": 500000,
            "rebate": "250",
        },
        {
            "supplementtype": "JKL",
            "price": 100000,
            "rebate": "750",
        },
    ],
}

for s in d["supplement"]:
    try:
        d["supplementtype_{}_price".format(s["supplementtype"])] = s["price"]
    except:
        pass
    try:
        d["supplementtype_{}_rebate".format(s["supplementtype"])] = s["rebate"]
    except:
        pass

del d["supplement"]

df = pd.DataFrame([d])
print(df)
               product         product_id         product_type          producer          currency         client_id  supplementtype_RTZ_price supplementtype_RTZ_rebate  supplementtype_CVB_price supplementtype_CVB_rebate  supplementtype_JKL_price supplementtype_JKL_rebate
0  example_productname  example_productid  example_producttype  example_producer  example_currency  example_clientid                    300000                       500                    500000                       250                    100000                       750

The used/referenced code:

def flatten_json(nested_json: dict, exclude: list=[''], sep: str='_') -> dict:
    """
    Flatten a list of nested dicts.
    """
    out = dict()
    def flatten(x: (list, dict, str), name: str='', exclude=exclude):
        if type(x) is dict:
            for a in x:
                if a not in exclude:
                    flatten(x[a], f'{name}{a}{sep}')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, f'{name}{i}{sep}')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(nested_json)
    return out

# list of files
files = ['test1.json', 'test2.json']

# list to add dataframe from each file
df_list = list()

# iterate through files
for file in files:
    with open(file, 'r') as f:

        # read with json
        data = json.loads(f.read())

        # flatten_json into a dataframe and add to the dataframe list
        df_list.append(pd.DataFrame.from_dict(flatten_json(data), orient='index').T)
        
# concat all dataframes together
df = pd.concat(df_list).reset_index(drop=True)
Kaschmir
  • 69
  • 7
  • Thanks for the help with making the tables work, Tomerikoo. Mind sharing why they rendered in the preview, but not in the actual post? :) – Kaschmir Apr 09 '21 at 15:52

1 Answers1

1

You can modify the dictionary before you create dataframe from it:

d = {
    "product": "example_productname",
    "product_id": "example_productid",
    "product_type": "example_producttype",
    "producer": "example_producer",
    "currency": "example_currency",
    "client_id": "example_clientid",
    "supplement": [
        {
            "supplementtype": "RTZ",
            "price": 300000,
            "rebate": "500",
        },
        {
            "supplementtype": "CVB",
            "price": 500000,
            "rebate": "250",
        },
        {
            "supplementtype": "JKL",
            "price": 100000,
            "rebate": "750",
        },
    ],
}

for s in d["supplement"]:
    d["supplementtype_{}_price".format(s["supplementtype"])] = s["price"]
    d["supplementtype_{}_rebate".format(s["supplementtype"])] = s["rebate"]

del d["supplement"]

df = pd.DataFrame([d])
print(df)

Prints:

               product         product_id         product_type          producer          currency         client_id  supplementtype_RTZ_price supplementtype_RTZ_rebate  supplementtype_CVB_price supplementtype_CVB_rebate  supplementtype_JKL_price supplementtype_JKL_rebate
0  example_productname  example_productid  example_producttype  example_producer  example_currency  example_clientid                    300000                       500                    500000                       250                    100000                       750
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • 1
    Thank you! This was an easy solution. I have slightly adjusted it to run even when one of the "supplement"-columns is inexistent, and updated my question with it. I guess I have a lot to learn about dictionaries/python data management. – Kaschmir Apr 12 '21 at 14:54