0

I'd like to change the structure of Dataframe on Pyspark.

root
 |-- roster_id: long (nullable = true)
 |-- members: struct (nullable = true)
 |    |-- m10: struct (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- address: string (nullable = true)
 |    |    |-- hobby_1: string (nullable = true
 |    |    |-- hobby_2: string (nullable = true
 |    |-- m15: struct (nullable = true)
 |    |    |-- name: string (nullable = true)
 ~~~~~~~

I want to

root
 |-- roster_id: long (nullable = true)
 |-- member_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- address: string (nullable = true)
 |-- hobby_1: string (nullable = true)
 |-- hobby_2: string (nullable = true)

But there is a problem.

・I do not know what is in "members.X".

・"members.X.X"(example hobby_2) may not be depending on member.

I think this is difficult. Is there a way?

Please tell me if using Pyspark is not suitable.

Example

RowData

{
  "roster_id": "abc",
  "members": {
    "m10": {
      "name": "John",
      "address": "Tokyo",
      "hobby_1": "Baseball",
      "hobby_2": "Teniss"
    },
    "m15": {
      "name": "Paul",
      "address": "NY",
      "hobby_1": "Music"
    }
  }
}

I want to

+---------+---------+-------+-------+--------+-------+
|roster_id|member_id|   name| adress|hobby_1 |hobby_2|
+---------+---------+-------+-------+--------+-------+
|      abc|      m10|   John|  Tokyo|Baseball|  Music|
+---------+---------+-------+-------+--------+-------+
|      abc|      m15|   Paul|     NY|   Music|   null|
+---------+---------+-------+-------+--------+-------+
  • Please take a look at [How to make good reproducible Apache Spark Dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples). It's not clear how your data looks like and what you're trying to achieve, e.g. where does `member_id` comes from, what "may not be depending on member" means, etc? Also, do you need a separate row for each _member_ or something else? – Sergey Khudyakov Aug 30 '18 at 09:06
  • Thank you for Comment. member_id is "m10" in this example, also wrote an example of json. I first searched for members and looped on each members, But it is a very heavy Spark process. I want to flatten and query at once. – Masaru Sasaki Aug 30 '18 at 10:23

0 Answers0