My question raised when I exploited this helpful answer provided by Trenton McKinney on the issue of flattening multiple nested JSON-files for handling in pandas.
Following his advice, I have used the flatten_json
function described here to flatten a batch of nested json files. However, I have run into a problem with the uniformity of my JSON-files.
A single JSON-File looks roughly like this made-up example data:
{
"product": "example_productname",
"product_id": "example_productid",
"product_type": "example_producttype",
"producer": "example_producer",
"currency": "example_currency",
"client_id": "example_clientid",
"supplement": [
{
"supplementtype": "RTZ",
"price": 300000,
"rebate": "500",
},
{
"supplementtype": "CVB",
"price": 500000,
"rebate": "250",
},
{
"supplementtype": "JKL",
"price": 100000,
"rebate": "750",
},
],
}
Utilizing the referenced code, I will end up with data looking like this:
product | product_id | product_type | producer | currency | client_id | supplement_0_supplementtype | supplement_0_price | supplement_0_rebate | supplement_1_supplementtype | supplement_1_price | supplement_1_rebate | etc |
---|---|---|---|---|---|---|---|---|---|---|---|---|
example_productname | example_productid | example_type | example_producer | example_currency | example_clientid | RTZ | 300000 | 500 | CVB | 500000 | 250 | etc |
example_productname2 | example_productid2 | example_type2 | example_producer2 | example_currency2 | example_clientid2 | CVB | 500000 | 250 | RTZ | 300000 | 500 | etc |
There are multiple issues with this.
Firstly, in my data, there is a limited list of "supplements", however, they do not always appear, and if they do, they are not always in the same order. In the example table, you can see that the two "supplements" switched positions in the second row. I would prefer a fixed order of the "supplement columns".
Secondly, the best option would be a table like this:
product | product_id | product_type | producer | currency | client_id | supplement_RTZ_price | supplement_RTZ_rebate | supplement_CVB_price | supplement_CVB_rebate | etc |
---|---|---|---|---|---|---|---|---|---|---|
example_productname | example_productid | example_type | example_producer | example_currency | example_clientid | 300000 | 500 | 500000 | 250 | etc |
I have tried editing the flatten_json
function referenced, but I don't have an inkling of how to make this work.
The solution consists of simply editing the dictionary (thanks to Andrej Kesely). I just added a pass to exceptions in case some columns are inexistent.
d = {
"product": "example_productname",
"product_id": "example_productid",
"product_type": "example_producttype",
"producer": "example_producer",
"currency": "example_currency",
"client_id": "example_clientid",
"supplement": [
{
"supplementtype": "RTZ",
"price": 300000,
"rebate": "500",
},
{
"supplementtype": "CVB",
"price": 500000,
"rebate": "250",
},
{
"supplementtype": "JKL",
"price": 100000,
"rebate": "750",
},
],
}
for s in d["supplement"]:
try:
d["supplementtype_{}_price".format(s["supplementtype"])] = s["price"]
except:
pass
try:
d["supplementtype_{}_rebate".format(s["supplementtype"])] = s["rebate"]
except:
pass
del d["supplement"]
df = pd.DataFrame([d])
print(df)
product product_id product_type producer currency client_id supplementtype_RTZ_price supplementtype_RTZ_rebate supplementtype_CVB_price supplementtype_CVB_rebate supplementtype_JKL_price supplementtype_JKL_rebate
0 example_productname example_productid example_producttype example_producer example_currency example_clientid 300000 500 500000 250 100000 750
The used/referenced code:
def flatten_json(nested_json: dict, exclude: list=[''], sep: str='_') -> dict:
"""
Flatten a list of nested dicts.
"""
out = dict()
def flatten(x: (list, dict, str), name: str='', exclude=exclude):
if type(x) is dict:
for a in x:
if a not in exclude:
flatten(x[a], f'{name}{a}{sep}')
elif type(x) is list:
i = 0
for a in x:
flatten(a, f'{name}{i}{sep}')
i += 1
else:
out[name[:-1]] = x
flatten(nested_json)
return out
# list of files
files = ['test1.json', 'test2.json']
# list to add dataframe from each file
df_list = list()
# iterate through files
for file in files:
with open(file, 'r') as f:
# read with json
data = json.loads(f.read())
# flatten_json into a dataframe and add to the dataframe list
df_list.append(pd.DataFrame.from_dict(flatten_json(data), orient='index').T)
# concat all dataframes together
df = pd.concat(df_list).reset_index(drop=True)