How can I explode a struct in a dataframe without hard-coding the column names?

Question

Consider one of my data sets as an example below. This is the result of df.printSchema()

member: struct (nullable = true)
 |   address: struct (nullable = true)
 |    |   city: string (nullable = true)
 |    |   state: string (nullable = true)
 |    |   streetAddress: string (nullable = true)
 |    |   zipCode: string (nullable = true)
 |   birthDate: string (nullable = true)
 |   groupIdentification: string (nullable = true)
 |   memberCode: string (nullable = true)
 |   patientName: struct (nullable = true)
 |    |   first: string (nullable = true)
 |    |   last: string (nullable = true)
memberContractCode: string (nullable = true)
memberContractType: string (nullable = true)
memberProductCode: string (nullable = true)

This data is read in via json, and I want to flatten this out so all are on the same level and so that my dataframe only contains primitive types, like so:

member.address.city: string (nullable = true)
member.address.state: string (nullable = true)
member.address.streetAddress: string (nullable = true)
member.address.zipCode: string (nullable = true)
member.birthDate: string (nullable = true)
member.groupIdentification: string (nullable = true)
member.memberCode: string (nullable = true)...

I know this can be done by manually specifying the column names like so:

df = df.withColumn("member.address.city", df("member.address.city")).withColumn("member.address.state", df("member.address.state"))...

However, I won't be able to hardcode the column names like above for all of my data sets, as the program needs to be able to process new datasets on the fly without any changes to the actual code. I want to make a general method that can explode any type of structure, given that it is already in a dataframe and the schema is known (but is a subset of the full schema). Is this possible in Spark 1.6? And if so, how

score 4 · Accepted Answer · answered Nov 03 '17 at 20:45

This should do it - you'll need to iterate over the schema and "flatten" it, by handling fields of type StructType separately from "simple" fields:

// helper recursive method to "flatten" the schema:
def getFields(parent: String, schema: StructType): Seq[String] = schema.fields.flatMap {
  case StructField(name, t: StructType, _, _) => getFields(parent + name + ".", t)
  case StructField(name, _, _, _) => Seq(s"$parent$name")
}

// apply to our DF's schema:
val fields: Seq[String] = getFields("", df.schema)

// select these fields:
val result = df.select(fields.map(name => $"$name" as name): _*)

How can I explode a struct in a dataframe without hard-coding the column names?

1 Answers1

Linked