Flatten json dynamically using python, all nested keys in a column and value in another

Question

I have a requirment to flatten json into keys and values using pyspark/python, so all the nested keys goes into a column value and the corresponding values goes into another column.

Also to note input json is dynamic, so in below sample there could be multiple subkey and child keys. Appreciate if anyone can help on this

sample json Input:

{
    "key1": {
        "subkey1":"1.1",
        "subkey2":"1.2"
        },
    "key2": {
        "subkey1":"2.1",
        "subkey2":"2.2",
        "subkey3": {"child3": { "subchild3":"2.3.3.3" } }
        },
    "key3": {
        "subkey1":"3.1",
        "subkey2":"3.2"
        }
}

Expected Output: To flatten only key2 from the nested keys

ID	key	value
1	key2.subkey1	2.1
2	key2.subkey2	2.2
3	key2.subkey3.child3.subchild3	2.3.3.3

Is the result you seek a `pandas.DataFrame()` or a python list of lists or something else? — JonSG, Feb 02 '23 at 16:17
@JonSG I did try with normalize but got an error 'Column' object is not callable' `input_df= spark.read.option("multiline","true").json("filename") df = pd.json_normalize(input_df)` — Venk AV, Feb 03 '23 at 08:27

Claudio · Answer 1 · 2023-02-03T13:09:04.080

The following code provides you with all you need to accomplish what you want to achieve:

data = \
  { "key1": {
         "subkey1":"1.1",
         "subkey2":"1.2"
            },

    "key2": {
        "subkey1":"2.1",
        "subkey2":"2.2",
        "subkey3": {
                    "child3": { 
                               "subchild3":"2.3.3.3" 
                              } 
                   }
          }
  }
print(data)
ID      = 0
lstRows = []
def getTableRow(data, key):
    global lstRows, ID
    for k, v in data.items():
        #print('for k,v:', k,v)
        if isinstance(v, dict):
            #print('dict:',v)
            if key=='': 
                getTableRow(v, k)
            else:
                getTableRow(v, key +'.'+ k)
        else:
            #print('lstRows.append()')
            ID += 1
            lstRows.append({"ID":ID, "key":key +'.'+ k, "value":v})
getTableRow(data, '') 
print( lstRows )
dctTable = {"ID":[],"key":[], "value":[]}
for dct in lstRows:
    dctTable["ID"].append(dct["ID"])
    dctTable["key"].append(dct["key"])
    dctTable["value"].append(dct["value"])
print( dctTable )

import pandas as pd
df = pd.DataFrame.from_dict(dctTable)
# df = pd.DataFrame(lstRows)  # equivalent to above .from_dict()
# df = pd.DataFrame(dctTable) # equivalent to above .from_dict()
print(df)

prints

{'key1': {'subkey1': '1.1', 'subkey2': '1.2'}, 'key2': {'subkey1': '2.1', 'subkey2': '2.2', 'subkey3': {'child3': {'subchild3': '2.3.3.3'}}}}
[{'ID': 1, 'key': 'key1.subkey1', 'value': '1.1'}, {'ID': 2, 'key': 'key1.subkey2', 'value': '1.2'}, {'ID': 3, 'key': 'key2.subkey1', 'value': '2.1'}, {'ID': 4, 'key': 'key2.subkey2', 'value': '2.2'}, {'ID': 5, 'key': 'key2.subkey3.child3.subchild3', 'value': '2.3.3.3'}]
{'ID': [1, 2, 3, 4, 5], 'key': ['key1.subkey1', 'key1.subkey2', 'key2.subkey1', 'key2.subkey2', 'key2.subkey3.child3.subchild3'], 'value': ['1.1', '1.2', '2.1', '2.2', '2.3.3.3']}
   ID                            key    value
0   1                   key1.subkey1      1.1
1   2                   key1.subkey2      1.2
2   3                   key2.subkey1      2.1
3   4                   key2.subkey2      2.2
4   5  key2.subkey3.child3.subchild3  2.3.3.3

It uses a recursive call of a function creating the rows of the resulting table.

As I don't use pyspark the shown table was created using Pandas.

See also here ( "Flatten nested dictionaries, compressing keys" ) for a general and flexible way of flattening a nested dictionary handling also values being lists.

See below code for further instructions and explanations requested in the comments:

# ======================================================================
# You can read the json file content directly into the dct_data using
# the Python json.load(fp) function ( fp=open(filename) ). 

# This code starts with an in str_data stored json file content:    
str_data = """
  { "key1": {
         "subkey1":"1.1",
         "subkey2":"1.2"
            },

    "key2": {
        "subkey1":"2.1",
        "subkey2":"2.2",
        "subkey3": {
                    "child3": { 
                               "subchild3":"2.3.3.3" 
                              } 
                   }
          }
  }"""
# print(str_data)

# Let's create a Python dictionary from the json data string:  
import json
dct_data = json.loads(str_data) # or = json.load(open(filename))
print(dct_data)

# Here the function for flattening the dictionary dct_data returning 
# a dictionary with flattened dct_data content: 
def flattenNestedDictionary(dct_data, key='', ID=0, lstRows=[]):
    #global lstRows, ID
    for k, v in dct_data.items():
        #print('for k,v:', k,v)
        if isinstance(v, dict):
            #print('dict:',v)
            if key=='': 
                flattenNestedDictionary(v, k, ID, lstRows)
            else:
                flattenNestedDictionary(v, key +'.'+ k, ID, lstRows)
        else:
            #print('lstRows.append()')
            ID += 1
            lstRows.append({"ID":ID, "key":key +'.'+ k, "value":v})

    # now lstRows has all the required content so let's create the
    # flattened dictionary:
    if ID==0:  
        print('lstRows:', lstRows )
        dct_flattened_json = {"ID":[],"key":[], "value":[]}
        for dct in lstRows:
            dct_flattened_json["ID"].append(dct["ID"])
            dct_flattened_json["key"].append(dct["key"])
            dct_flattened_json["value"].append(dct["value"])
        print('#', dct_flattened_json )
        return dct_flattened_json

dct_flattened_json = flattenNestedDictionary(dct_data)

# Let's create a valid json data string out of the dictionary: 
str_flattened_json = json.dumps(dct_flattened_json)
print('>', str_flattened_json)

# you can now write the str_flattened_json string to a file and load 
# the new json file with flattened data into spark DataFrame. 
# Or you load the string  str_flattened_json  into a spark DataFrame.

above code works if the input is provided within the code but not when I read as a file, I am using `spark.read.option("multiline","true").json("filename")` Error 'DataFrame' object has no attribute 'item', also tried with 'iterrows' it errored again — Venk AV, Feb 03 '23 at 08:31
If you don't provide the json file to start with and try to use a spark DataFrame instead of a dictionary in code expecting a Python dictionary no wonder you get errors. Get a dictionary from the json file to get what you have posted in you question as sample json input to avoid errors. Or post an actual example of a json file if you don't know how get a json file content into a Python dictionary. — Claudio, Feb 03 '23 at 08:55
thanks for your assistance but one more help on the example provided above, how do I get data related to key2 only instead of key1. If i try with `for k, v in data['key1'].items():` it works but does not work when it's for key2 — Venk AV, Feb 03 '23 at 10:12
What about deleting `data['key1']` before calling `getTableRow` function? `key2` values need the recursive function call within the function to handle the nested data, where `key1` values are not further nested. ( P.S. to be honest I don't understand what you are asking for and what does not work when it's for key2? ) — Claudio, Feb 03 '23 at 10:59
See my updated answer for additional explanation and generally the same code as above but rewritten with the purpose of better understanding how to use it. — Claudio, Feb 03 '23 at 12:47
Appreciate your help, I have edited my question with further details on requirment, this is to pick specific nested key and it's value pair only. I can delete the nested keys which are not required but the actual Input data has multiple nested keys and it's dynamic(may or maynot exist), so option to pick a specific keys will be the right option. — Venk AV, Feb 06 '23 at 06:42
Hopefully you don't expect me to react to changes in the requirements to your question again. If my answer was/is helpful you can mark it as accepted. And if it is obsolete because you have changed some details in the requirements and does not consider your question answered ... that's life ... not my business. — Claudio, Feb 06 '23 at 07:09

Flatten json dynamically using python, all nested keys in a column and value in another

1 Answers1