0

I have a dataframe with the following schema :

root
|-- first_name: string
|-- last_name: string
|-- details: array
|    |-- element: struct
|    |    |-- university: string
|    |    |-- subjects: struct
|    |    |    |-- subject1: string
|    |    |    |-- subject2: string
|-- grades: array
|    |-- element: struct
|    |    |-- sem1: string
|    |    |-- sem2: struct

and I want to flatten it to the following schema so that i don't have any structs anymore, I have arrays as independent columns instead.

root
|-- first_name: string
|-- last_name: string
|-- details.university: array
     |-- element: string
|-- details.subjects.subject1: array
     |-- element: string
|-- details.subjects.subject2: array
     |-- element: string
|-- grades.sem1: array
     |-- element: string
|-- grades.sem2: array
     |-- element: string

I am struggling to do the same and I'd really appreciate some help with this. Thank you!

  • please also post a dummy dataframe, so we don't have to take the trouble of creating our own, Also add an expected output. [How to make good reproducible Apache Spark examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples) will help in that – anky Aug 18 '20 at 17:16

1 Answers1

0

Simply select the required columns

df.select('first_name','last_name','details.university','details.subjects.subject1',
          'details.subjects.subject2','grades.sem1','grades.sem2')
Shubham Jain
  • 5,327
  • 2
  • 15
  • 38