1

Existing Schema of the DF:

|-- col1: string (nullable = true)
|-- col2: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- col2_1: string (nullable = true)
|    |    |-- col2_2: string (nullable = true)

Example data for the existing Schema:

 col1         col2    
  A       [[0,2],[1,3]]
  B       [[1,5]] 
  C       [[5,9],[4,6],[2,6]] 

Required Schema:

|-- col1: timestamp (nullable = true)
|-- col2_1: string (nullable = true)
|-- col2_2: string (nullable = true)

Example data for required Schema:

 col1          col2_1    col2_2    
  A            0           2
  A            1           3
  B            1           5
  C            5           9
  C            4           6
  C            2           6

Code:

var df_flattened = df.select($"*", explode($"col2").as("flat")).select($"*",$"flat.col2_1",$"flat.col2_2").drop("col2")

I am not getting any errors with the code. but its missing values from the original DF, where the distinct(col1) is ~20000 in the original and it changes to ~6000 after the flattening.

Any suggestions on the mistake.

data_person
  • 4,194
  • 7
  • 40
  • 75

1 Answers1

0

explode() will not emit any rows where the array being exploded is null. Therefore, you should use explode_outer() instead.

Sim
  • 13,147
  • 9
  • 66
  • 95