Flatten complex JSON schema using pyspark

Question

I am trying to flatten a complex JSON structure containing nested arrays, struct elements using a generic function which should work for any JSON files with any schema.

Below is a part of sample JSON structure which I want to flatten

root
 |-- Data: struct (nullable = true)
 |    |-- Record: struct (nullable = true)
 |    |    |-- FName: string (nullable = true)
 |    |    |-- LName: long (nullable = true)
 |    |    |-- Address: struct (nullable = true)
 |    |    |    |-- Applicant: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- Id: long (nullable = true)
 |    |    |    |    |    |-- Type: string (nullable = true)
 |    |    |    |    |    |-- Option: long (nullable = true)
 |    |    |    |-- Location: string (nullable = true)
 |    |    |    |-- Town: long (nullable = true)
 |    |    |-- IsActive: boolean (nullable = true)
 |-- Id: string (nullable = true)

to

root
 |-- Data_Record_FName: string (nullable = true)
 |-- Data_Record_LName: long (nullable = true)
 |-- Data_Record_Address_Applicant_Id: long (nullable = true)
 |-- Data_Record_Address_Applicant_Type: string (nullable = true)
 |-- Data_Record_Address_Applicant_Option: long (nullable = true)
 |-- Data_Record_Address_Location: string (nullable = true)
 |-- Data_Record_Address_Town: long (nullable = true)
 |-- Data_Record_IsActive: boolean (nullable = true)
 |-- Id: string (nullable = true)

I am using the code below as suggested in below thread

How to flatten a struct in a Spark dataframe?

def flatten_df(nested_df, layers):
    flat_cols = []
    nested_cols = []
    flat_df = []

    flat_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] != 'struct'])
    nested_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] == 'struct'])

    flat_df.append(nested_df.select(flat_cols[0] +
                               [col(nc+'.'+c).alias(nc+'_'+c)
                                for nc in nested_cols[0]
                                for c in nested_df.select(nc+'.*').columns])
                  )
    for i in range(1, layers):
        print (flat_cols[i-1])
        flat_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] != 'struct'])
        nested_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] == 'struct'])

        flat_df.append(flat_df[i-1].select(flat_cols[i] +
                                [col(nc+'.'+c).alias(nc+'_'+c)
                                    for nc in nested_cols[i]
                                    for c in flat_df[i-1].select(nc+'.*').columns])
        )

    return flat_df[-1]

my_flattened_df = flatten_df(jsonDF, 10)
my_flattened_df.printSchema()

But it doesn't work for array elements. With above code I am getting output as below. Can you please help. How can I modify this piece of code to include arrays too.

root
 |-- Data_Record_FName: string (nullable = true)
 |-- Data_Record_LName: long (nullable = true)
 |-- Data_Record_Address_Applicant: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Id: long (nullable = true)
 |    |    |-- Type: string (nullable = true)
 |    |    |-- Option: long (nullable = true)
 |-- Data_Record_Address_Location: string (nullable = true)
 |-- Data_Record_Address_Town: long (nullable = true)
 |-- Data_Record_IsActive: boolean (nullable = true)
 |-- Id: string (nullable = true)

This is not a duplicate as there is no post regarding a generic function to flatten complex JSON schema that includes arrays too.

use `explode` on the array, then process again with your code. — Steven, Jul 09 '19 at 14:52
Yes, I tried to modify the above as below but isnt working as expected. Can you help. array_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:5] == 'array']) flat_df.append(flat_df[i-1].select(flat_cols[i] + [col(nc+'.'+c).alias(nc+'_'+c) for nc in nested_cols[i] for c in flat_df[i-1].select(nc+'.*').columns for nc in array_cols[i] for c in flat_df[i-1].select(explode(nc+'.*').columns)]) ) — jakrm, Jul 09 '19 at 15:08

Flatten complex JSON schema using pyspark

0 Answers0