Existing Schema of the DF:
|-- col1: string (nullable = true)
|-- col2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- col2_1: string (nullable = true)
| | |-- col2_2: string (nullable = true)
Example data for the existing Schema:
col1 col2
A [[0,2],[1,3]]
B [[1,5]]
C [[5,9],[4,6],[2,6]]
Required Schema:
|-- col1: timestamp (nullable = true)
|-- col2_1: string (nullable = true)
|-- col2_2: string (nullable = true)
Example data for required Schema:
col1 col2_1 col2_2
A 0 2
A 1 3
B 1 5
C 5 9
C 4 6
C 2 6
Code:
var df_flattened = df.select($"*", explode($"col2").as("flat")).select($"*",$"flat.col2_1",$"flat.col2_2").drop("col2")
I am not getting any errors with the code. but its missing values from the original DF, where the distinct(col1) is ~20000 in the original and it changes to ~6000 after the flattening.
Any suggestions on the mistake.