0

I've Spark DataFrame with a Array column (StringType)

Sample DataFrame:

df = spark.createDataFrame([
  [None],   
  [[]],   
  [['foo']] 
]).toDF("a")

Current Output:

+-----+
|    a|
+-----+
| null|
|   []|
|[foo]|
+-----+

Desired Output:

+-----+
|    a|
+-----+
|   []|
|   []|
|[foo]|
+-----+

I need to convert the Null values to an empty Array to concat with another array column.

Already tried this, but it's not working

df.withColumn("a",F.coalesce(F.col("a"),F.from_json(F.lit("[]"), T.ArrayType(T.StringType()))))

Convert null values to empty array in Spark DataFrame

John Doe
  • 9,843
  • 13
  • 42
  • 73

1 Answers1

3

Use array function.

df = spark.createDataFrame([
  [None],   
  [[]],   
  [['foo']] 
]).toDF("a")

import pyspark.sql.functions as F

df.withColumn('a', F.coalesce(F.col('a'), F.array(F.lit(None)))).show(10, False)
+-----+
|a    |
+-----+
|[]   |
|[]   |
|[foo]|
+-----+

The result is now array(string), so there is no null value. Please check the results.

temp = spark.sql("SELECT a FROM table WHERE a is NULL")
temp.show(10, False)
temp = spark.sql("SELECT a FROM table WHERE a = array(NULL)")
temp.show(10, False)
temp = spark.sql("SELECT a FROM table")
temp.show(10, False)


+---+
|a  |
+---+
+---+

+---+
|a  |
+---+
|[] |
+---+

+-----+
|a    |
+-----+
|[]   |
|[]   |
|[foo]|
+-----+
Lamanus
  • 12,898
  • 4
  • 21
  • 47
  • I still have Null values. Column is `ArrayType(StringType)` Does that make sense? – John Doe Aug 14 '20 at 12:04
  • what version of spark, python? – Lamanus Aug 14 '20 at 12:07
  • Spark 3 and Python 3.6 – John Doe Aug 14 '20 at 12:10
  • Even my code is for spark 3.0.0 and python 3.8.5. I think the spark version is the matter, so it should work. I have added my full code. – Lamanus Aug 14 '20 at 12:15
  • It's a bit weird. If I query in Superset with a `WHERE IS NULL` then I don't have any Null values for any column. When I query without the `WHERE` clause (with the respective columns) I see Null values. – John Doe Aug 14 '20 at 12:22
  • schema: `col_a: array (nullable = false) --element: string (containsNull = true)` `col_b: array (nullable = false) --element: string (containsNull = true)` `col_c: array (nullable = false) --element: string (containsNull = true)` – John Doe Aug 14 '20 at 12:27
  • I've 3 columns to concatenate; `col_a` result is same as your example. For `col_b `and `col_c `the result for query 2 is equal to query 1. `col_aa` is converted from `varchar` to `Array(String)` and `col_b`& `col_c` were already of the type `Array(String)` . After the concatenate the new column is sometimes `Null` while there's a value in one of the cols. – John Doe Aug 14 '20 at 14:33
  • I need the duplicates and i've 3 columns. It's the input for a UDF. – John Doe Aug 14 '20 at 14:49
  • Can you update your question with reproducible data? I cannot imagine how it happen. – Lamanus Aug 14 '20 at 14:50
  • I checked the new column the same as `col_a`..etc. and there are no hits for query 1. I'll try to export the results to csv to check – John Doe Aug 14 '20 at 14:55
  • 1
    Results are not displayed properly (Presto/Superset) . After export and verification of the CSV everything appears to be correct. Thank you very much for your patience and help !! – John Doe Aug 14 '20 at 15:54