How to make highly nested dictionary data into data frames

Question

I've got answers in past question, but when it has multiple top-level indexes (date) and different values for each index, there's a problem. so I changed the data structure based on the advice from the previous comment, but I'm not sure if it's correct. Also, it's hard to understand how to index.

{'20211229': {'00101': ['비고101-1',
                    '비고101-2',
                    '비고101-3',
                    {0: [' UT1213K', '1', '11.00', '11', '2']}]},
 '20211230': {'00102': ['비고102-1',
                    '비고102-2',
                    '비고102-3',
                    {0: ['B 001', '2', '22.00', '44', '5']}]},
 '20211231': {'00103': ['비고103-1',
                    '비고103-2',
                    '비고103-3',
                    {0: ['B 004', '10', '66.00', '660', '66'],
                     1: ['B 005', '20', '77.00', '1540', '154'],
                     2: ['B 006', '30', '88.00', '2640', '264'],
                     3: ['B 007', '40', '99.00', '3960', '396']}],
          '00104': ['비고1',
                    '비고2',
                    '',
                    {0: ['B 003', '3', '33.00', '99', '10'],
                     1: ['B 004', '4', '44.00', '176', '18']}]}}

I want to index as below. I'd really appreciate your help.

Does this answer your question? [JSON to pandas DataFrame](https://stackoverflow.com/questions/21104592/json-to-pandas-dataframe) — user2263572, Jun 30 '21 at 15:22
I am trying to understand according to the advice of the link you gave me. Thanks for your help. — Clerami, Jun 30 '21 at 16:33
The data structure is hard to understand and parse. In fact, it is not a proper JSON structure. Do you really need to reduce the size of data by ignore the key name? — Ferris, Jul 01 '21 at 00:23
I deleted the 'key name' and unnecessary structure to apply the advice you gave me in the previous post. Maybe I didn't understand. I will change the structure and apply it according to the method you told me below. Thank you very, very much. — Clerami, Jul 01 '21 at 02:12

score 2 · Answer 1 · edited Jul 01 '21 at 02:41

IIUC, you can try:

df =pd.DataFrame({'20211229': {'00101': ['비고101-1',
                    '비고101-2',
                    '비고101-3',
                    {0: [' UT1213K', '1', '11.00', '11', '2']}]},
 '20211230': {'00102': ['비고102-1',
                    '비고102-2',
                    '비고102-3',
                    {0: ['B 001', '2', '22.00', '44', '5']}]},
 '20211231': {'00103': ['비고103-1',
                    '비고103-2',
                    '비고103-3',
                    {0: ['B 004', '10', '66.00', '660', '66'],
                     1: ['B 005', '20', '77.00', '1540', '154'],
                     2: ['B 006', '30', '88.00', '2640', '264'],
                     3: ['B 007', '40', '99.00', '3960', '396']}],
          '00104': ['비고1',
                    '비고2',
                    '',
                    {0: ['B 003', '3', '33.00', '99', '10'],
                     1: ['B 004', '4', '44.00', '176', '18']}]}})

df1 = df.stack().apply(pd.Series)
df = pd.concat([df1.drop(3, 1), df1[3].str[0].apply(pd.Series)], 1)

OUTPUT:

                      0        1        2         0   1      2    3   4
00101 20211229  비고101-1  비고101-2  비고101-3   UT1213K   1  11.00   11   2
00102 20211230  비고102-1  비고102-2  비고102-3     B 001   2  22.00   44   5
00103 20211231  비고103-1  비고103-2  비고103-3     B 004  10  66.00  660  66
00104 20211231      비고1      비고2              B 003   3  33.00   99  10

update:

# df1 = df.stack().apply(pd.Series)
obj2 = df1[3].map(lambda x: list(x.items())).explode()
df2 = obj2.str[0].to_frame()
df2.columns = ['item']
df2[['item_code', 'qty', 'supply_price', 'tax_amount', 'unit_price']] = obj2.str[1].apply(pd.Series)
pd.merge(df1.drop(3,axis=1), df2, left_index=True, right_index=True)

ouptput:

What is 'df1'? This object is not a combination of multiple data frames because the result is obtained from a repetitive statement. — Clerami, Jun 30 '21 at 15:41
It occurs 'TypeError: cannot concatenate object of type '; only Series and DataFrame objs are valid' After changing the index to string, it's the same result. Could you show me the results of the test? — Clerami, Jun 30 '21 at 16:20
The results of the method you told me show that it does not return all values in the 'raw' data. Thank you for helping me. — Clerami, Jul 01 '21 at 02:40

score 1 · Answer 2 · answered Jul 01 '21 at 01:11

If you don't mind the data size, maybe you can restructure the data.

import json
import pandas as pd

json_str = '''
[{
    "date": "20211229",
    "customer_code": "00101",
    "remark1": "비고101-1",
    "remark2": "비고101-2",
    "remark3": "비고101-3",
    "items": [{
        "item_code": "UT1213K",
        "qty": 1,
        "unit_price": 11,
        "supply_price": 11,
        "tax_amount": 2
    }]
}, {
    "date": "20211230",
    "customer_code": "00102",
    "remark1": "비고102-1",
    "remark2": "비고102-2",
    "remark3": "비고102-3",
    "items": [{
        "item_code": "B 001",
        "qty": 2,
        "unit_price": 22,
        "supply_price": 44,
        "tax_amount": 6
    }]
}, {
    "date": "20211231",
    "customer_code": "00103",
    "remark1": "비고103-1",
    "remark2": "비고103-2",
    "remark3": "비고103-3",
    "items": [{
        "item_code": "b 004",
        "qty": 10,
        "unit_price": 66,
        "supply_price": 660,
        "tax_amount": 66
    }, {
        "item_code": "b 005",
        "qty": 20,
        "unit_price": 77,
        "supply_price": 1540,
        "tax_amount": 154
    }, {
        "item_code": "b 006",
        "qty": 30,
        "unit_price": 88,
        "supply_price": 2640,
        "tax_amount": 264
    }, {
        "item_code": "b 007",
        "qty": 40,
        "unit_price": 99,
        "supply_price": 3960,
        "tax_amount": 396
    }]
}, {
    "date": "20211231",
    "customer_code": "00104",
    "remark1": "비고1",
    "remark2": "비고2",
    "remark3": "",
    "items": [{
        "item_code": "b 003",
        "qty": 3,
        "unit_price": 33,
        "supply_price": 99,
        "tax_amount": 10
    }, {
        "item_code": "b 004",
        "qty": 4,
        "unit_price": 44,
        "supply_price": 176,
        "tax_amount": 18
    }]
}]
'''

data = json.loads(json_str)
df = pd.DataFrame(data)
obj = df['items'].explode()
dfn = pd.DataFrame(obj.tolist(), index=obj.index)

cols = ['date', 'customer_code', 'remark1', 'remark2', 'remark3']
df_output = df[cols].join(dfn)
df_output['item'] = 1
df_output['item'] = df_output.groupby(cols)['item'].cumsum() - 1

(df_output.set_index(cols)
 [['item', 'item_code', 'qty', 'unit_price', 'supply_price', 'tax_amount']])

output:

score 1 · Accepted Answer · answered Jul 01 '21 at 01:56

First of all, let's solve this problem.

raw = {
    "20211229": {
        "00101": [
            "비고101-1",
            "비고101-2",
            "비고101-3",
            {0: [" UT1213K", "1", "11.00", "11", "2"]},
        ]
    },
    "20211230": {
        "00102": [
            "비고102-1",
            "비고102-2",
            "비고102-3",
            {0: ["B 001", "2", "22.00", "44", "5"]},
        ]
    },
    "20211231": {
        "00103": [
            "비고103-1",
            "비고103-2",
            "비고103-3",
            {
                0: ["B 004", "10", "66.00", "660", "66"],
                1: ["B 005", "20", "77.00", "1540", "154"],
                2: ["B 006", "30", "88.00", "2640", "264"],
                3: ["B 007", "40", "99.00", "3960", "396"],
            },
        ],
        "00104": [
            "비고1",
            "비고2",
            "",
            {
                0: ["B 003", "3", "33.00", "99", "10"],
                1: ["B 004", "4", "44.00", "176", "18"],
            },
        ],
    },
}

formatted_dict = []
for first_level_key, first_level_value in raw.items():
    for second_level_key, second_level_value in first_level_value.items():
        third_level_tmp_list = []
        for third_level_value in second_level_value:
            if isinstance(third_level_value, str):
                third_level_tmp_list.append(third_level_value)
        third_level_tmp_dict = dict(
            zip(["remark1", "remark2", "remark3"], third_level_tmp_list)
        )
        for third_level_value in second_level_value:
            if isinstance(third_level_value, dict):
                for (
                    fourth_level_key,
                    fourth_level_value,
                ) in third_level_value.items():
                    new_record = {}
                    new_record.update(
                        {
                            "date": first_level_key,
                            "customer_code": second_level_key,
                            "item": fourth_level_key,
                        }
                    )
                    new_record.update(
                        dict(
                            zip(
                                [
                                    "item_code",
                                    "qty",
                                    "unit_price",
                                    "supply_price",
                                    "tax_amount",
                                ],
                                fourth_level_value,
                            )
                        )
                    )
                    new_record.update(third_level_tmp_dict)
                    formatted_dict.append(new_record)

print(formatted_dict)
result = pd.DataFrame(formatted_dict).set_index(
    ["date", "customer_code", "remark1", "remark2", "remark3"]
)[["item", "item_code", "qty", "unit_price", "supply_price", "tax_amount"]]
pd.set_option("display.max_columns", 500)
pd.set_option("display.width", 1000)
print(result)

Second, I suggest you do not construct raw data like that. I suggest you make your raw data like this.

[
    {
        "date": "20211229",
        "customer_code": "00101",
        "item": 0,
        "item_code": " UT1213K",
        "qty": "1",
        "unit_price": "11.00",
        "supply_price": "11",
        "tax_amount": "2",
        "remark1": "비고101-1",
        "remark2": "비고101-2",
        "remark3": "비고101-3",
    },
    {
        "date": "20211230",
        "customer_code": "00102",
        "item": 0,
        "item_code": "B 001",
        "qty": "2",
        "unit_price": "22.00",
        "supply_price": "44",
        "tax_amount": "5",
        "remark1": "비고102-1",
        "remark2": "비고102-2",
        "remark3": "비고102-3",
    },
    {
        "date": "20211231",
        "customer_code": "00103",
        "item": 0,
        "item_code": "B 004",
        "qty": "10",
        "unit_price": "66.00",
        "supply_price": "660",
        "tax_amount": "66",
        "remark1": "비고103-1",
        "remark2": "비고103-2",
        "remark3": "비고103-3",
    },
    {
        "date": "20211231",
        "customer_code": "00103",
        "item": 1,
        "item_code": "B 005",
        "qty": "20",
        "unit_price": "77.00",
        "supply_price": "1540",
        "tax_amount": "154",
        "remark1": "비고103-1",
        "remark2": "비고103-2",
        "remark3": "비고103-3",
    },
    {
        "date": "20211231",
        "customer_code": "00103",
        "item": 2,
        "item_code": "B 006",
        "qty": "30",
        "unit_price": "88.00",
        "supply_price": "2640",
        "tax_amount": "264",
        "remark1": "비고103-1",
        "remark2": "비고103-2",
        "remark3": "비고103-3",
    },
    {
        "date": "20211231",
        "customer_code": "00103",
        "item": 3,
        "item_code": "B 007",
        "qty": "40",
        "unit_price": "99.00",
        "supply_price": "3960",
        "tax_amount": "396",
        "remark1": "비고103-1",
        "remark2": "비고103-2",
        "remark3": "비고103-3",
    },
    {
        "date": "20211231",
        "customer_code": "00104",
        "item": 0,
        "item_code": "B 003",
        "qty": "3",
        "unit_price": "33.00",
        "supply_price": "99",
        "tax_amount": "10",
        "remark1": "비고1",
        "remark2": "비고2",
        "remark3": "",
    },
    {
        "date": "20211231",
        "customer_code": "00104",
        "item": 1,
        "item_code": "B 004",
        "qty": "4",
        "unit_price": "44.00",
        "supply_price": "176",
        "tax_amount": "18",
        "remark1": "비고1",
        "remark2": "비고2",
        "remark3": "",
    },
]

If your raw data like this, You could just get your result simply like this.

result = pd.DataFrame(raw).set_index(
    ["date", "customer_code", "remark1", "remark2", "remark3"]
)[["item", "item_code", "qty", "unit_price", "supply_price", "tax_amount"]]

How to make highly nested dictionary data into data frames

3 Answers3

OUTPUT:

Linked