I have searched now a couple of hours through the net to find any hint about how to explode a nested struct
in an Apache DataFrame using pyspark. For array
and map
types there exists the function explode()
in pyspark.sql.functions
, however for a struct
type such a function is not available (would be a nice feature though).
This is my problem:
I have a DataFrame that looks the following:
newJSON = '{"level_instance_json":{"events":{"0":{"id":1,"visible":true},"1":{"id":2,"visible":true},"2":{"id":1,"visible":false},"3":{"id":2,"visible":false}}},"user_id":"a1"}'
newJSON2 = '{"level_instance_json":{"events":{"0":{"id":1,"visible":true},"1":{"id":2,"visible":true},"2":{"id":1,"visible":false},"3":{"id":2,"visible":false}}},"user_id":"b2"}'
dfJSON = spark.read.json( sc.parallelize( [newJSON, newJSON2] ) )
dfJSON.printSchema()
You see clearly the nested structure in the level_instance_json
struct
. Now what I want to have is all of these nested structures exploded such that there are for each numbered struct
(i.e. 0
, 1
, 2
, 3
) a new row (numbered by 0
, 1
, 2
, 3
). So, all my variables are on the same level:
event_id | id | visible | user_id
0 1 true a1
1 2 true a1
2 1 false a1
3 2 false a1
0 1 true b2
1 2 true b2
2 1 false b2
3 2 false b2
It is probably first a kind of exlode
and then something like transpose
(from column to row). How can this be done?
For my final task I have to somehow loop through all the numbered structs and I assume that exploding the nested structures makes this more efficient. If you have other suggestions, feel free to tell. I appreciate any good advice here.