Explode in Spark missing records

Question

Existing Schema of the DF:

|-- col1: string (nullable = true)
|-- col2: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- col2_1: string (nullable = true)
|    |    |-- col2_2: string (nullable = true)

Example data for the existing Schema:

 col1         col2    
  A       [[0,2],[1,3]]
  B       [[1,5]] 
  C       [[5,9],[4,6],[2,6]]

Required Schema:

|-- col1: timestamp (nullable = true)
|-- col2_1: string (nullable = true)
|-- col2_2: string (nullable = true)

Example data for required Schema:

 col1          col2_1    col2_2    
  A            0           2
  A            1           3
  B            1           5
  C            5           9
  C            4           6
  C            2           6

Code:

var df_flattened = df.select($"*", explode($"col2").as("flat")).select($"*",$"flat.col2_1",$"flat.col2_2").drop("col2")

I am not getting any errors with the code. but its missing values from the original DF, where the distinct(col1) is ~20000 in the original and it changes to ~6000 after the flattening.

Any suggestions on the mistake.

@ollik1 No it cannot be empty. But it can have variable size like in the example above. — data_person, Mar 15 '19 at 18:25
@ollik1 You are right. the value of col2 was null in some cases and those records were missing. — data_person, Mar 15 '19 at 19:19

score 0 · Accepted Answer · answered Mar 17 '19 at 21:58

0

explode() will not emit any rows where the array being exploded is null. Therefore, you should use explode_outer() instead.

answered Mar 17 '19 at 21:58

Sim

13,147
9
66
95

Explode in Spark missing records

1 Answers1