1

Consider one of my data sets as an example below. This is the result of df.printSchema()

member: struct (nullable = true)
 |   address: struct (nullable = true)
 |    |   city: string (nullable = true)
 |    |   state: string (nullable = true)
 |    |   streetAddress: string (nullable = true)
 |    |   zipCode: string (nullable = true)
 |   birthDate: string (nullable = true)
 |   groupIdentification: string (nullable = true)
 |   memberCode: string (nullable = true)
 |   patientName: struct (nullable = true)
 |    |   first: string (nullable = true)
 |    |   last: string (nullable = true)
memberContractCode: string (nullable = true)
memberContractType: string (nullable = true)
memberProductCode: string (nullable = true)

This data is read in via json, and I want to flatten this out so all are on the same level and so that my dataframe only contains primitive types, like so:

member.address.city: string (nullable = true)
member.address.state: string (nullable = true)
member.address.streetAddress: string (nullable = true)
member.address.zipCode: string (nullable = true)
member.birthDate: string (nullable = true)
member.groupIdentification: string (nullable = true)
member.memberCode: string (nullable = true)...

I know this can be done by manually specifying the column names like so:

df = df.withColumn("member.address.city", df("member.address.city")).withColumn("member.address.state", df("member.address.state"))...

However, I won't be able to hardcode the column names like above for all of my data sets, as the program needs to be able to process new datasets on the fly without any changes to the actual code. I want to make a general method that can explode any type of structure, given that it is already in a dataframe and the schema is known (but is a subset of the full schema). Is this possible in Spark 1.6? And if so, how

zero323
  • 322,348
  • 103
  • 959
  • 935
meowsephine
  • 372
  • 1
  • 3
  • 15

1 Answers1

4

This should do it - you'll need to iterate over the schema and "flatten" it, by handling fields of type StructType separately from "simple" fields:

// helper recursive method to "flatten" the schema:
def getFields(parent: String, schema: StructType): Seq[String] = schema.fields.flatMap {
  case StructField(name, t: StructType, _, _) => getFields(parent + name + ".", t)
  case StructField(name, _, _, _) => Seq(s"$parent$name")
}

// apply to our DF's schema:
val fields: Seq[String] = getFields("", df.schema)

// select these fields:
val result = df.select(fields.map(name => $"$name" as name): _*)
Tzach Zohar
  • 37,442
  • 3
  • 79
  • 85