Consider one of my data sets as an example below. This is the result of df.printSchema()
member: struct (nullable = true)
| address: struct (nullable = true)
| | city: string (nullable = true)
| | state: string (nullable = true)
| | streetAddress: string (nullable = true)
| | zipCode: string (nullable = true)
| birthDate: string (nullable = true)
| groupIdentification: string (nullable = true)
| memberCode: string (nullable = true)
| patientName: struct (nullable = true)
| | first: string (nullable = true)
| | last: string (nullable = true)
memberContractCode: string (nullable = true)
memberContractType: string (nullable = true)
memberProductCode: string (nullable = true)
This data is read in via json, and I want to flatten this out so all are on the same level and so that my dataframe only contains primitive types, like so:
member.address.city: string (nullable = true)
member.address.state: string (nullable = true)
member.address.streetAddress: string (nullable = true)
member.address.zipCode: string (nullable = true)
member.birthDate: string (nullable = true)
member.groupIdentification: string (nullable = true)
member.memberCode: string (nullable = true)...
I know this can be done by manually specifying the column names like so:
df = df.withColumn("member.address.city", df("member.address.city")).withColumn("member.address.state", df("member.address.state"))...
However, I won't be able to hardcode the column names like above for all of my data sets, as the program needs to be able to process new datasets on the fly without any changes to the actual code. I want to make a general method that can explode any type of structure, given that it is already in a dataframe and the schema is known (but is a subset of the full schema). Is this possible in Spark 1.6? And if so, how