0

I have a requirment to flatten json into keys and values using pyspark/python, so all the nested keys goes into a column value and the corresponding values goes into another column.

Also to note input json is dynamic, so in below sample there could be multiple subkey and child keys. Appreciate if anyone can help on this

sample json Input:

{
    "key1": {
        "subkey1":"1.1",
        "subkey2":"1.2"
        },
    "key2": {
        "subkey1":"2.1",
        "subkey2":"2.2",
        "subkey3": {"child3": { "subchild3":"2.3.3.3" } }
        },
    "key3": {
        "subkey1":"3.1",
        "subkey2":"3.2"
        }
}

Expected Output: To flatten only key2 from the nested keys

ID key value
1 key2.subkey1 2.1
2 key2.subkey2 2.2
3 key2.subkey3.child3.subchild3 2.3.3.3
Venk AV
  • 67
  • 1
  • 10

1 Answers1

1

The following code provides you with all you need to accomplish what you want to achieve:

data = \
  { "key1": {
         "subkey1":"1.1",
         "subkey2":"1.2"
            },

    "key2": {
        "subkey1":"2.1",
        "subkey2":"2.2",
        "subkey3": {
                    "child3": { 
                               "subchild3":"2.3.3.3" 
                              } 
                   }
          }
  }
print(data)
ID      = 0
lstRows = []
def getTableRow(data, key):
    global lstRows, ID
    for k, v in data.items():
        #print('for k,v:', k,v)
        if isinstance(v, dict):
            #print('dict:',v)
            if key=='': 
                getTableRow(v, k)
            else:
                getTableRow(v, key +'.'+ k)
        else:
            #print('lstRows.append()')
            ID += 1
            lstRows.append({"ID":ID, "key":key +'.'+ k, "value":v})
getTableRow(data, '') 
print( lstRows )
dctTable = {"ID":[],"key":[], "value":[]}
for dct in lstRows:
    dctTable["ID"].append(dct["ID"])
    dctTable["key"].append(dct["key"])
    dctTable["value"].append(dct["value"])
print( dctTable )

import pandas as pd
df = pd.DataFrame.from_dict(dctTable)
# df = pd.DataFrame(lstRows)  # equivalent to above .from_dict()
# df = pd.DataFrame(dctTable) # equivalent to above .from_dict()
print(df)

prints

{'key1': {'subkey1': '1.1', 'subkey2': '1.2'}, 'key2': {'subkey1': '2.1', 'subkey2': '2.2', 'subkey3': {'child3': {'subchild3': '2.3.3.3'}}}}
[{'ID': 1, 'key': 'key1.subkey1', 'value': '1.1'}, {'ID': 2, 'key': 'key1.subkey2', 'value': '1.2'}, {'ID': 3, 'key': 'key2.subkey1', 'value': '2.1'}, {'ID': 4, 'key': 'key2.subkey2', 'value': '2.2'}, {'ID': 5, 'key': 'key2.subkey3.child3.subchild3', 'value': '2.3.3.3'}]
{'ID': [1, 2, 3, 4, 5], 'key': ['key1.subkey1', 'key1.subkey2', 'key2.subkey1', 'key2.subkey2', 'key2.subkey3.child3.subchild3'], 'value': ['1.1', '1.2', '2.1', '2.2', '2.3.3.3']}
   ID                            key    value
0   1                   key1.subkey1      1.1
1   2                   key1.subkey2      1.2
2   3                   key2.subkey1      2.1
3   4                   key2.subkey2      2.2
4   5  key2.subkey3.child3.subchild3  2.3.3.3

It uses a recursive call of a function creating the rows of the resulting table.

As I don't use pyspark the shown table was created using Pandas.

See also here ( "Flatten nested dictionaries, compressing keys" ) for a general and flexible way of flattening a nested dictionary handling also values being lists.


See below code for further instructions and explanations requested in the comments:

# ======================================================================
# You can read the json file content directly into the dct_data using
# the Python json.load(fp) function ( fp=open(filename) ). 

# This code starts with an in str_data stored json file content:    
str_data = """
  { "key1": {
         "subkey1":"1.1",
         "subkey2":"1.2"
            },

    "key2": {
        "subkey1":"2.1",
        "subkey2":"2.2",
        "subkey3": {
                    "child3": { 
                               "subchild3":"2.3.3.3" 
                              } 
                   }
          }
  }"""
# print(str_data)

# Let's create a Python dictionary from the json data string:  
import json
dct_data = json.loads(str_data) # or = json.load(open(filename))
print(dct_data)

# Here the function for flattening the dictionary dct_data returning 
# a dictionary with flattened dct_data content: 
def flattenNestedDictionary(dct_data, key='', ID=0, lstRows=[]):
    #global lstRows, ID
    for k, v in dct_data.items():
        #print('for k,v:', k,v)
        if isinstance(v, dict):
            #print('dict:',v)
            if key=='': 
                flattenNestedDictionary(v, k, ID, lstRows)
            else:
                flattenNestedDictionary(v, key +'.'+ k, ID, lstRows)
        else:
            #print('lstRows.append()')
            ID += 1
            lstRows.append({"ID":ID, "key":key +'.'+ k, "value":v})

    # now lstRows has all the required content so let's create the
    # flattened dictionary:
    if ID==0:  
        print('lstRows:', lstRows )
        dct_flattened_json = {"ID":[],"key":[], "value":[]}
        for dct in lstRows:
            dct_flattened_json["ID"].append(dct["ID"])
            dct_flattened_json["key"].append(dct["key"])
            dct_flattened_json["value"].append(dct["value"])
        print('#', dct_flattened_json )
        return dct_flattened_json

dct_flattened_json = flattenNestedDictionary(dct_data)

# Let's create a valid json data string out of the dictionary: 
str_flattened_json = json.dumps(dct_flattened_json)
print('>', str_flattened_json)

# you can now write the str_flattened_json string to a file and load 
# the new json file with flattened data into spark DataFrame. 
# Or you load the string  str_flattened_json  into a spark DataFrame.  
Claudio
  • 7,474
  • 3
  • 18
  • 48
  • above code works if the input is provided within the code but not when I read as a file, I am using `spark.read.option("multiline","true").json("filename")` Error 'DataFrame' object has no attribute 'item', also tried with 'iterrows' it errored again – Venk AV Feb 03 '23 at 08:31
  • If you don't provide the json file to start with and try to use a spark DataFrame instead of a dictionary in code expecting a Python dictionary no wonder you get errors. Get a dictionary from the json file to get what you have posted in you question as sample json input to avoid errors. Or post an actual example of a json file if you don't know how get a json file content into a Python dictionary. – Claudio Feb 03 '23 at 08:55
  • thanks for your assistance but one more help on the example provided above, how do I get data related to key2 only instead of key1. If i try with `for k, v in data['key1'].items():` it works but does not work when it's for key2 – Venk AV Feb 03 '23 at 10:12
  • What about deleting `data['key1']` before calling `getTableRow` function? `key2` values need the recursive function call within the function to handle the nested data, where `key1` values are not further nested. ( P.S. to be honest I don't understand what you are asking for and what does not work when it's for key2? ) – Claudio Feb 03 '23 at 10:59
  • See my updated answer for additional explanation and generally the same code as above but rewritten with the purpose of better understanding how to use it. – Claudio Feb 03 '23 at 12:47
  • Appreciate your help, I have edited my question with further details on requirment, this is to pick specific nested key and it's value pair only. I can delete the nested keys which are not required but the actual Input data has multiple nested keys and it's dynamic(may or maynot exist), so option to pick a specific keys will be the right option. – Venk AV Feb 06 '23 at 06:42
  • Hopefully you don't expect me to react to changes in the requirements to your question again. If my answer was/is helpful you can mark it as accepted. And if it is obsolete because you have changed some details in the requirements and does not consider your question answered ... that's life ... not my business. – Claudio Feb 06 '23 at 07:09