2

Being a noob in Scala / Spark, am a bit stuck and would appreciate any help!

Am importing JSON data into a Spark Data Frame. In the process, I end up getting a Data frame having the same nested structure present in the JSON input.

My aim is to flatten the entire Data Frame recursively (including the inner most child attributes in an array / dictionary), using Scala.

Additionally, there may be children attributes which have the same names. Hence, need to differentiate them as well.

A somewhat similar solution (same child attributes for different parents) is shown here - https://stackoverflow.com/a/38460312/3228300

An example of what I am hoping to achieve is as follows:

{
    "id": "0001",
    "type": "donut",
    "name": "Cake",
    "ppu": 0.55,
    "batters":
        {
            "batter":
                [
                    { "id": "1001", "type": "Regular" },
                    { "id": "1002", "type": "Chocolate" },
                    { "id": "1003", "type": "Blueberry" },
                    { "id": "1004", "type": "Devil's Food" }
                ]
        },
    "topping":
        [
            { "id": "5001", "type": "None" },
            { "id": "5002", "type": "Glazed" },
            { "id": "5005", "type": "Sugar" },
            { "id": "5007", "type": "Powdered Sugar" },
            { "id": "5006", "type": "Chocolate with Sprinkles" },
            { "id": "5003", "type": "Chocolate" },
            { "id": "5004", "type": "Maple" }
        ]
}

The corresponding flattened output Spark DF structure would be:

{
    "id": "0001",
    "type": "donut",
    "name": "Cake",
    "ppu": 0.55,
    "batters_batter_id_0": "1001", 
    "batters_batter_type_0": "Regular",
    "batters_batter_id_1": "1002", 
    "batters_batter_type_1": "Chocolate",
    "batters_batter_id_2": "1003", 
    "batters_batter_type_2": "Blueberry",
    "batters_batter_id_3": "1004", 
    "batters_batter_type_3": "Devil's Food",
    "topping_id_0": "5001",
    "topping_type_0": "None",
    "topping_id_1": "5002", 
    "topping_type_1": "Glazed",
    "topping_id_2": "5005", 
    "topping_type_2": "Sugar",
    "topping_id_3": "5007", 
    "topping_type_3": "Powdered Sugar",
    "topping_id_4": "5006", 
    "topping_type_4": "Chocolate with Sprinkles",
    "topping_id_5": "5003", 
    "topping_type_5": "Chocolate",
    "topping_id_6": "5004", 
    "topping_type_6": "Maple"
}

Not having worked much with Scala and Spark previously, am unsure how to proceed.

Lastly, would be extremely thankful if someone can please help with the code for a general / non-schema solution as I need to be applying it to a lot of different collections.

Thanks a lot :)

vsdaking
  • 476
  • 9
  • 22

2 Answers2

0

Here is one possibility we approach it in one of our project

  1. List item

define a case class that maps a row from the dataframe

case class BattersTopics(id: String, type: String, ..., batters_batter_id_0: String, ..., topping_id_0: String)
  1. List item

map each row from the dataframe to case class

df.map(row => BattersTopics(id = row.getAs[String]("id"), ..., 
   batters_batter_id_0 = row.getAs[String]("batters_batter_id_0 "), ...)

Collect to a list and make a Map[String, Any] from the dataframe

val rows = dataSet.collect().toList
rows.map(bt => Map (
 "id" -> bt.id,
 "type" -> bt.type, 
 "batters" -> Map(
    "batter" -> List(Map("id" -> bt.batters_batter_id_0, "type" -> 
       bt.batters_batter_type_0), ....) // same for the others id and types
    "topping" -> List(Map("id"-> bt.topping_id_0, "type" -> bt.topping_type_0), ...) // same for the others id and type
  ) 
))
  1. Use Jackson to convert the Map[String, Any] to Json
dumitru
  • 2,068
  • 14
  • 23
  • Hi @dumitru, Firstly, thanks for helping out :) From what I see, you have generated the mapping for the existing schema to the output schema. However, my JSON file might have some additional dictionary elements from what I might currently consider (generated dynamically). Hence, need code for flattening recursively on unknown schema. e.g. https://stackoverflow.com/a/37473765/3228300 However, need to consider child attr. with same names as well. – vsdaking Jul 17 '17 at 16:13
0

Sample Data : which contains All different types of JSON element (Nested JSON Map, JSON Array, long, String etc..)

{"name":"Akash","age":16,"watches":{"name":"Apple","models":["Apple Watch Series 5","Apple Watch Nike"]},"phones":[{"name":"Apple","models":["iphone X","iphone XR","iphone XS","iphone 11","iphone 11 Pro"]},{"name":"Samsung","models":["Galaxy Note10","Galaxy S10e","Galaxy S10"]},{"name":"Google","models":["Pixel 3","Pixel 3a"]}]}
root
|— age: long (nullable = true)
| — name: string (nullable = true)
| — phones: array (nullable = true)
| | — element: struct (containsNull = true)
| | | — models: array (nullable = true)
| | | | — element: string (containsNull = true)
| | | — name: string (nullable = true)
| — watches: struct (nullable = true)
| | — models: array (nullable = true)
| | | — element: string (containsNull = true)
| | — name: string (nullable = true)

this is the sample data which have arraytype and structtype (Map) values in json Data.

We can use write first two switch conditions for each type and repeat this process unlit it flattens out to the required Dataframe.

https://medium.com/@ajpatel.bigdata/flatten-json-data-with-apache-spark-java-api-5f6a8e37596b

Here, is the Spark Java API solution.

Giulio Caccin
  • 2,962
  • 6
  • 36
  • 57
AP-Big Data
  • 182
  • 1
  • 9