Being a noob in Scala / Spark, am a bit stuck and would appreciate any help!
Am importing JSON data into a Spark Data Frame. In the process, I end up getting a Data frame having the same nested structure present in the JSON input.
My aim is to flatten the entire Data Frame recursively (including the inner most child attributes in an array / dictionary), using Scala.
Additionally, there may be children attributes which have the same names. Hence, need to differentiate them as well.
A somewhat similar solution (same child attributes for different parents) is shown here - https://stackoverflow.com/a/38460312/3228300
An example of what I am hoping to achieve is as follows:
{
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
{ "id": "1003", "type": "Blueberry" },
{ "id": "1004", "type": "Devil's Food" }
]
},
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "type": "Sugar" },
{ "id": "5007", "type": "Powdered Sugar" },
{ "id": "5006", "type": "Chocolate with Sprinkles" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
]
}
The corresponding flattened output Spark DF structure would be:
{
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters_batter_id_0": "1001",
"batters_batter_type_0": "Regular",
"batters_batter_id_1": "1002",
"batters_batter_type_1": "Chocolate",
"batters_batter_id_2": "1003",
"batters_batter_type_2": "Blueberry",
"batters_batter_id_3": "1004",
"batters_batter_type_3": "Devil's Food",
"topping_id_0": "5001",
"topping_type_0": "None",
"topping_id_1": "5002",
"topping_type_1": "Glazed",
"topping_id_2": "5005",
"topping_type_2": "Sugar",
"topping_id_3": "5007",
"topping_type_3": "Powdered Sugar",
"topping_id_4": "5006",
"topping_type_4": "Chocolate with Sprinkles",
"topping_id_5": "5003",
"topping_type_5": "Chocolate",
"topping_id_6": "5004",
"topping_type_6": "Maple"
}
Not having worked much with Scala and Spark previously, am unsure how to proceed.
Lastly, would be extremely thankful if someone can please help with the code for a general / non-schema solution as I need to be applying it to a lot of different collections.
Thanks a lot :)