PySpark flatten dataframe having some columns as array of nested structs

Question

I have a dataframe with the following schema :

root
|-- first_name: string
|-- last_name: string
|-- details: array
|    |-- element: struct
|    |    |-- university: string
|    |    |-- subjects: struct
|    |    |    |-- subject1: string
|    |    |    |-- subject2: string
|-- grades: array
|    |-- element: struct
|    |    |-- sem1: string
|    |    |-- sem2: struct

and I want to flatten it to the following schema so that i don't have any structs anymore, I have arrays as independent columns instead.

root
|-- first_name: string
|-- last_name: string
|-- details.university: array
     |-- element: string
|-- details.subjects.subject1: array
     |-- element: string
|-- details.subjects.subject2: array
     |-- element: string
|-- grades.sem1: array
     |-- element: string
|-- grades.sem2: array
     |-- element: string

I am struggling to do the same and I'd really appreciate some help with this. Thank you!

please also post a dummy dataframe, so we don't have to take the trouble of creating our own, Also add an expected output. [How to make good reproducible Apache Spark examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples) will help in that — anky, Aug 18 '20 at 17:16

score 0 · Answer 1 · answered Aug 19 '20 at 05:36

0

Simply select the required columns

df.select('first_name','last_name','details.university','details.subjects.subject1',
          'details.subjects.subject2','grades.sem1','grades.sem2')

answered Aug 19 '20 at 05:36

Shubham Jain

5,327
2
15
38

The datatype here is array of structs, not just struct so I don't think this will work here. – nishant26900 Aug 19 '20 at 06:01
Try sir.. and it'll create the array – Shubham Jain Aug 19 '20 at 06:34

PySpark flatten dataframe having some columns as array of nested structs

1 Answers1