We want to break a nested data structure into separate entities with Spark & Scala. The structure is like:
root
|-- timestamp: string (nullable = true)
|-- contract: struct (nullable = true)
| |-- category: string (nullable = true)
| |-- contractId: array (nullable = true)
| |-- items: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- active: boolean (nullable = true)
| | | |-- itemId: string (nullable = true)
| | | |-- subItems: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- elementId: string (nullable = true)
We want to have contracts, items and subItems in separate data collections. The sub-entities should contain references to their parent, and the top-level fields (timestamp) as audit fields.
Contracts:
- auditTimestamp
- category
- contractId
Items:
- auditTimestamp
- contractId (foreignKey)
- active
- itemId
SubItems:
- auditTimestamp
- itemId (foreignKey)
- elementID
We don't want to configure all the necessary attributes specifically, but only the respective parent attribute to extract, the foreignKey (reference), and what should NOT be extracted (e.g. contract should not contain items, an item should not contain subelements).
We tried with dataframe.select("*").select(explode("contract.*"))
and the likes, but we can't make it.
Any ideas on how to do this elegantly are welcome.
Best Alex