How to flatten the JSON format data into spark dataframe

Question

I am trying to convert 2 levels of nested json into pyspark dataframe. Below is my JSON schema looks like:

I am always getting nulls while converting to spark dataframe for products struct which is the last level of nested JSON.

Does this answer your question? [How to flatten a struct in a Spark dataframe?](https://stackoverflow.com/questions/38753898/how-to-flatten-a-struct-in-a-spark-dataframe) — Steven, Sep 23 '21 at 07:57
I have used one function which will flatten mulitlevel nested json. But got an error: 'ValueError: field rData: Length of object (1) does not match with length of fields (3)' — Arjun R, Sep 23 '21 at 08:24

Drashti Dobariya · Answer 1 · 2021-09-23T08:32:27.257

If the structure is fixed as shown in description then try this:

df.select(df.col("b_Code"), df.col("b_Key"),df.col("r_data.s_key"), df.col("r_data.s_Code"), df.col("r_data.products.s_key"), df.col("r_data.products.s_Code"), df.col("r_data.products.s_Type"), df.col("r_data.products.r_type"), df.col("r_data.products.sl"), df.col("r_data.products.sp"))

Here is a function that will flatten nested df irrespective of level of nesting in json


from pyspark.sql.functions import col

def flatten_df(nested_df):
    stack = [((), nested_df)]
    columns = []

    while len(stack) > 0:
        parents, df = stack.pop()

        flat_cols = [
            col(".".join(parents + (c[0],))).alias("_".join(parents + (c[0],)))
            for c in df.dtypes
            if c[1][:6] != "struct"
        ]

        nested_cols = [
            c[0]
            for c in df.dtypes
            if c[1][:6] == "struct"
        ]

        columns.extend(flat_cols)

        for nested_col in nested_cols:
            projected_df = df.select(nested_col + ".*")
            stack.append((parents + (nested_col,), projected_df))

    return nested_df.select(columns)

Tried this function. But Got an error: 'ValueError: field rData: Length of object (1) does not match with length of fields (3)' — Arjun R, Sep 23 '21 at 08:26
Check the edited Answer. Also error seems to be due to broken json. — Drashti Dobariya, Sep 23 '21 at 08:34

score 1 · Answer 2 · answered Sep 23 '21 at 09:19

Have you tried to force the schema ?

you can try this, because, apparently, you have a different schema in each files, so enforcing the proper schema should solve your problem :

from pyspark.sql import types as T


schema = T.StructType(
    [
        T.StructField("b_key", T.IntegerType()),
        T.StructField("b_code", T.StringType()),
        T.StructField(
            "r_date",
            T.StructType(
                [
                    T.StructField("s_key", T.IntegerType()),
                    T.StructField("s_code", T.StringType()),
                    T.StructField(
                        "products",
                        T.StructType(
                            [
                                T.StructField("s_key", T.IntegerType()),
                                T.StructField("s_key", T.IntegerType()),
                                T.StructField("s_code", T.StringType()),
                                T.StructField("s_type", T.StringType()),
                                T.StructField("r_type", T.StringType()),
                                T.StructField("sl", T.DecimalType()),
                                T.StructField("sp", T.IntegerType()),
                            ]
                        ),
                    ),
                ]
            ),
        ),
    ]
)


df = spark.read.json("path/to/file.json", schema=schema)

From, there, you do not have any array, so you can simply select the nested columns to flatten. For example :

df.selct(
    "r_data.*"
)

This will flatten the r_data struct column, and you will end up with 3 columns.

How to flatten the JSON format data into spark dataframe

2 Answers2