How to flatten PySpark 2.1.1 schema without using explode_outer to maintain NULL data points?

Question

I am having trouble finding a solution in python that allows me to keep null rows while flattening a nested schema. I have seen solutions but the functions end up either using explode_outer, which is for versions greater than 2.1.1 or they remove nulls, which is not what I want.

The following is the resource that I have read and attempted but does not work: Spark sql how to explode without losing null values

AdibP · Answer 1 · 2021-07-09T04:52:06.937

Suppose that your dataframe looks like this

df = spark.createDataFrame(data=[['Aaron', [10, 20]],
                                 ['Bob', []],
                                 ['Charlie', [15]],
                                 ['Dave', None]], 
                           schema=StructType([StructField('user',StringType(),True), 
                                              StructField('value',ArrayType(IntegerType(),True),True)]))
df.show()

+-------+--------+
|   user|   value|
+-------+--------+
|  Aaron|[10, 20]|
|    Bob|      []|
|Charlie|    [15]|
|   Dave|    null|
+-------+--------+

using size function on empty array and null value will return 0 and -1 respectively, so you can use it like this (make sure to cast it to the correct data type).

import pyspark.sql.functions as F

df = df.withColumn('value', F.explode(F.when(F.size('value') < 1, F.array(F.lit(None).cast('int'))).otherwise(F.col('value'))))
df.show()

+-------+-----+
|   user|value|
+-------+-----+
|  Aaron|   10|
|  Aaron|   20|
|    Bob| null|
|Charlie|   15|
|   Dave| null|
+-------+-----+

How to flatten PySpark 2.1.1 schema without using explode_outer to maintain NULL data points?

1 Answers1