2

I have searched now a couple of hours through the net to find any hint about how to explode a nested struct in an Apache DataFrame using pyspark. For array and map types there exists the function explode() in pyspark.sql.functions, however for a struct type such a function is not available (would be a nice feature though).

This is my problem:

I have a DataFrame that looks the following:

newJSON = '{"level_instance_json":{"events":{"0":{"id":1,"visible":true},"1":{"id":2,"visible":true},"2":{"id":1,"visible":false},"3":{"id":2,"visible":false}}},"user_id":"a1"}'
newJSON2 = '{"level_instance_json":{"events":{"0":{"id":1,"visible":true},"1":{"id":2,"visible":true},"2":{"id":1,"visible":false},"3":{"id":2,"visible":false}}},"user_id":"b2"}'
dfJSON = spark.read.json( sc.parallelize( [newJSON, newJSON2] ) ) 
dfJSON.printSchema()

enter image description here

You see clearly the nested structure in the level_instance_json struct. Now what I want to have is all of these nested structures exploded such that there are for each numbered struct (i.e. 0, 1, 2, 3) a new row (numbered by 0, 1, 2, 3). So, all my variables are on the same level:

event_id | id | visible | user_id
0          1    true      a1
1          2    true      a1
2          1    false     a1
3          2    false     a1
0          1    true      b2
1          2    true      b2
2          1    false     b2
3          2    false     b2         

It is probably first a kind of exlode and then something like transpose (from column to row). How can this be done?

For my final task I have to somehow loop through all the numbered structs and I assume that exploding the nested structures makes this more efficient. If you have other suggestions, feel free to tell. I appreciate any good advice here.

Simon Z.
  • 848
  • 1
  • 6
  • 12
  • 2
    Please [don't upload pictures of code/date](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-on-so-when-asking-a-question) when asking your question. Instead, please try to provide a [reproducible example](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples) with a small sample of your dataframe and the desired output. – pault Sep 07 '18 at 15:09
  • @pault I edited now my question and generated a reproducible example and the desired output. Thanks for bringing this to my attention. – Simon Z. Sep 15 '18 at 08:00

0 Answers0