Let say I have a DataFrame as follow :
case class SubClass(id:String, size:Int,useless:String)
case class MotherClass(subClasss: Array[SubClass])
val df = sqlContext.createDataFrame(List(
MotherClass(Array(
SubClass("1",1,"thisIsUseless"),
SubClass("2",2,"thisIsUseless"),
SubClass("3",3,"thisIsUseless")
)),
MotherClass(Array(
SubClass("4",4,"thisIsUseless"),
SubClass("5",5,"thisIsUseless")
))
))
The schema is :
root
|-- subClasss: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- size: integer (nullable = false)
| | |-- useless: string (nullable = true)
I'm looking for a way to select only a subset of fields : id
and size
of the array column subClasss
, but with keeping the nested array structure.
The resulting schema would be :
root
|-- subClasss: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- size: integer (nullable = false)
I've tried to do a
df.select("subClasss.id","subClasss.size")
But this splits the array subClasss
in two arrays :
root
|-- id: array (nullable = true)
| |-- element: string (containsNull = true)
|-- size: array (nullable = true)
| |-- element: integer (containsNull = true)
Is there a way to keep the origin structure and just to eliminate the useless
field ? Something that would look like :
df.select("subClasss.[id,size]")
Thanks for your time.