1

I am looking to dynamically flatten a parquet file in Spark with Scala efficiently. I was wondering what an efficient way to achieve this.

The parquet file contains multiple Array and Struct Type Nesting at multiple depth levels. The parquet file schema can change in the future, so I cannot hard code any attributes. The desired end result is a flattened delimited file.

Would a solution using flatmap and recursively exploding work?

Example Schema:

|-- exCar: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- exCarOne: string (nullable = true)
 |    |    |-- exCarTwo: string (nullable = true)
 |    |    |-- exCarThree: string (nullable = true)
 |-- exProduct: string (nullable = true)
 |-- exName: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- exNameOne: string (nullable = true)
 |    |    |-- exNameTwo: string (nullable = true)
 |    |    |-- exNameThree: string (nullable = true)
 |    |    |-- exNameFour: string (nullable = true)
 |    |    |-- exNameCode: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- exNameCodeOne: string (nullable = true)
 |    |    |    |    |-- exNameCodeTwo: string (nullable = true)
 |    |    |-- exColor: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- exColorOne: string (nullable = true)
 |    |    |    |    |-- exColorTwo: string (nullable = true)
 |    |    |    |    |-- exWheelColor: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- exWheelColorOne: string (nullable = true)
 |    |    |    |    |    |    |-- exWheelColorTwo: string (nullable = true)
 |    |    |    |    |    |    |--exWheelColorThree: string (nullable =true)
 |    |    |-- exGlass: string (nullable = true)
 |-- exDetails: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- exBill: string (nullable = true)
 |    |    |-- exAccount: string (nullable = true)
 |    |    |-- exLoan: string (nullable = true)
 |    |    |-- exRate: string (nullable = true)

Desired output Schema:

 exCar.exCarOne
 exCar.exCarTwo
 exCar.exCarThree
 exProduct
 exName.exNameOne
 exName.exNameTwo
 exName.exNameThree
 exName.exNameFour
 exName.exNameCode.exNameCodeOne
 exName.exNameCode.exNameCodeTwo
 exName.exColor.exColorOne
 exName.exColor.exColorTwo
 exName.exColor.exWheelColor.exWheelColorOne
 exName.exColor.exWheelColor.exWheelColorTwo
 exName.exColor.exWheelColor.exWheelColorThree
 exName.exGlass
 exDetails.exBill
 exDetails.exAccount
 exDetails.exLoan
 exDetails.exRate
Defcon
  • 807
  • 3
  • 15
  • 36

1 Answers1

0

There are 2 things that need to be done:

1) Explode the array columns from the most outer nested arrays to the ones lying inside: explode exName (giving you alot of rows with json that contains exColor), then exColor which you then explode allowing you access to exWheelColor, etc.

2) Project the nested json to a separate column.

Elmar Macek
  • 380
  • 4
  • 12
  • Right, on step two I think is where the basis of this question is. How does one avoid doing this without "self-join". For example, the exploded column as dataframe exName.exNameOne joined with exName.exNameTwo joined with exName.exNameCode.exNameCodeOne and so on. This would create major scaling issues. – Defcon Mar 25 '19 at 16:21
  • If you already know which elements of the arrays you want to access you can do so directly: https://stackoverflow.com/questions/34916038/sparksql-sql-syntax-for-nth-item-in-array – Elmar Macek Mar 25 '19 at 16:26
  • Elements/Schema are dynamic, so a hard coded solution could not be done. – Defcon Mar 25 '19 at 16:29
  • The column names you chose `exName.exColor.exWheelColor.exWheelColorOne/Two/Three` seem to me as if you would know what you want exactly. Which btw is the absolute requirement to transform one row structured data into one row unstructured-atomar-columns of ONE fix schema. – Elmar Macek Mar 25 '19 at 16:32
  • The schema will be extracted from the parquet file. For example there could be a schema for v1.00 and a week later v1.01, so the requirement would somehow need to account for this. So the file contains the schema at that particular time. – Defcon Mar 25 '19 at 17:49