-1

I would like to know an efficient approach here. Lets say we have a JSON data as follows,

root
 |-- fields: struct (nullable = true)
 |    |-- custid: string (nullable = true)
 |    |-- password: string (nullable = true)
 |    |-- role: string (nullable = true)

I can read this into data frame using,

jsonData_1.withColumn("custid", col("fields.custid")).withColumn("password", col("fields.password")).withColumn("role", col("fields.role"))

But if we have 100s of nested columns or if the cols are prone to change overtime or has more nested cols, I feel its not a good decision to use this approach. Is there any way we can make a code automatically look for all the columns and sub-cols and make a dataframe by reading the input JSON file? or is this the only good approach? Please share me your ideas here.

Lokesh
  • 87
  • 2
  • 11

1 Answers1

2

Don't need to specify each and every columns from structtype in spark.

We can extract all struct keys by specifying struct_field.* in .select

Example:

spark.read.json(Seq("""{"fields":{"custid":"1","password":"foo","role":"rr"}}""").toDS).printSchema

//schema
//root
// |-- fields: struct (nullable = true)
// |    |-- custid: string (nullable = true)
// |    |-- password: string (nullable = true)
// |    |-- role: string (nullable = true)

//read the json data into Dataframe.

val df=spark.read.json(Seq("""{"fields":{"custid":"1","password":"foo","role":"rr"}}""").toDS)

//get all fields values extracted from fields struct
df.select("fields.*").show()

//+------+--------+----+
//|custid|password|role|
//+------+--------+----+
//|     1|     foo|  rr|
//+------+--------+----+

More dynamic way of flattening json here:

import org.apache.spark.sql.Column
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.functions.col

def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {
      schema.fields.flatMap(f => {
        val colName = if (prefix == null) f.name else (prefix + "." + f.name)

        f.dataType match {
          case st: StructType => flattenSchema(st, colName)
          case _ => Array(col(colName))
        }
      })
    }

val df=spark.read.json(Seq("""{"fields":{"custid":"1","password":"foo","role":"rr","nested-2":{"id":"1"}}}""").toDS)

df.select(flattenSchema(df.schema):_*).show()

//+------+---+--------+----+
//|custid| id|password|role|
//+------+---+--------+----+
//|     1|  1|     foo|  rr|
//+------+---+--------+----+
notNull
  • 30,258
  • 4
  • 35
  • 50