I'd like to change the structure of Dataframe on Pyspark.
root
|-- roster_id: long (nullable = true)
|-- members: struct (nullable = true)
| |-- m10: struct (nullable = true)
| | |-- name: string (nullable = true)
| | |-- address: string (nullable = true)
| | |-- hobby_1: string (nullable = true
| | |-- hobby_2: string (nullable = true
| |-- m15: struct (nullable = true)
| | |-- name: string (nullable = true)
~~~~~~~
I want to
root
|-- roster_id: long (nullable = true)
|-- member_id: string (nullable = true)
|-- name: string (nullable = true)
|-- address: string (nullable = true)
|-- hobby_1: string (nullable = true)
|-- hobby_2: string (nullable = true)
But there is a problem.
・I do not know what is in "members.X".
・"members.X.X"(example hobby_2) may not be depending on member.
I think this is difficult. Is there a way?
Please tell me if using Pyspark is not suitable.
Example
RowData
{
"roster_id": "abc",
"members": {
"m10": {
"name": "John",
"address": "Tokyo",
"hobby_1": "Baseball",
"hobby_2": "Teniss"
},
"m15": {
"name": "Paul",
"address": "NY",
"hobby_1": "Music"
}
}
}
I want to
+---------+---------+-------+-------+--------+-------+
|roster_id|member_id| name| adress|hobby_1 |hobby_2|
+---------+---------+-------+-------+--------+-------+
| abc| m10| John| Tokyo|Baseball| Music|
+---------+---------+-------+-------+--------+-------+
| abc| m15| Paul| NY| Music| null|
+---------+---------+-------+-------+--------+-------+